Re: Can not find solr core on admin page after setup

2013-10-30 Thread engy.morsy
yes, I do. I installed the solr example instance.

Engy.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-not-find-solr-core-on-admin-page-after-setup-tp4098236p4098380.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can not find solr core on admin page after setup

2013-10-30 Thread Bayu Widyasanyata
Hi Engy,

Have you copy solr's war (e.g. solr-4.5.1.war, for latest Solr
distribution) from Solr source distribution to Tomcat's webapps directory
(rename to solr.war on webapps dir.)?

After put that file and restarted the Tomcat, it will create 'solr' folder
under webapps.

Or, if you still found no Admin page, pls check Tomcat log (catalina.out).

Thanks.-


On Tue, Oct 29, 2013 at 8:54 PM, engy.morsy engy.mo...@bibalex.org wrote:

 Hi,

 I setup solr4.2 under apache tomcat on windows m/c. I created solr.xml
 under
 catalina/localhost that holds the solr/home path, I have only one core, so
 the solr.xml under the solr instance looks like:

 cores adminPath=/admin/cores defaultCoreName=core0 
 core name=core0 instanceDir=core0 /
 cores

 after starting the apache service, I did not find the core on the admin
 page. I checked the logs but no errors were found. I checked that the data
 folder was created successfully. I am not even able to access the core
 directly.Any idea !!

 Thanks
 Engy



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Can-not-find-solr-core-on-admin-page-after-setup-tp4098236.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
wassalam,
[bayu]


Re: Can not find solr core on admin page after setup

2013-10-30 Thread engy.morsy
Hi Bayu , 

I did that but for solr 4.2, the catalaina.out has no exceptions at all.

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-not-find-solr-core-on-admin-page-after-setup-tp4098236p4098385.html
Sent from the Solr - User mailing list archive at Nabble.com.


SolrCloud full index replication on leader failure

2013-10-30 Thread Alejandro Marqués Rodríguez
Hi,

I have a problem with SolrCloud in an specific test case and I wanted to
know if it is the way it should work or if is there any way to avoid this...

I have the next scenario:

- Three machines
- Each one with one zookeeper and one solr 4.1.0
- Each Solr stores 7 Million documents and the index is 2GB

The test consist on sending queries to solr (100 concurrent queries
continously) and then forcing the leader failure by shutting down both
zookeeper and solr.

When we shut down any solr that is not the leader there are no problems,
the other two respond to the queries without problems. However if we shut
down the leader the next steps occur:

- Both Solrs continue responding to the queries until the leader election
starts
- One of them is elected as leader and the other one stops responding
queries (I've read it goes to recovery mode until its index is synchronized
with the leader's one)
- Then, even though both indexes are the same (They were synchronized
before the leader failure), the whole index is replicated.
- During the time while the 2GB are replicated from leader to the remaining
server, the server recovering is not responding to queries, therefore the
leader must attend to the whole amount of queries and finally it crashes
due to having to many queries to answer (Aside of replicating its index)

My question here is... Is it normal that the whole index replicates in a
leader change even though the leader and the other solr indexes should be
the same? Is there any way to avoid it? Maybe I have some configuration
wrong? Should changing Solr to 4.5.X avoid this operative?

Aside from this problem everything seems to work fine, but that point of
failure is too risky for us

Thanks in advance


-- 
Alejandro Marqués Rodríguez

Paradigma Tecnológico
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-30 Thread Toke Eskildsen
On Tue, 2013-10-29 at 14:24 +0100, eShard wrote:
 I have a 1 TB repository with approximately 500,000 documents (that will
 probably grow from there) that needs to be indexed.  

As Shawn point out, that isn't telling us much. If you describe the
documents, how and how often you index and how you query them, it will
help a lot.


Let me offer some observations from a related project we are starting at
Statsbiblioteket.


We are planning to index 20 TB harvested web resources (*.dk from the
last 8 years, or at least the resources our crawlers sunk their
tentacles into). We have two text indexes generated from about 1% and 2%
of that corpus, respectively. They are 200GB and 420GB in size and
contains ~75 million and (whoops, offline, so rememberguessing here)
~150 million documents.

For testing purposes we issued simple searches: 2-4 OR'ed terms, picked
at random from a Danish dictionary. One of our test machines is an 2*8
core Xeon machine with 32GB of RAM (about ~12GB free for caching) and
SSD as storage. We had room for a 2-shard cloud on the SSD's, so
searches were issued to 2*200GB index of a total of 150 million
documents. CentOS/Solr 4.3.

Hammering that machine with 32 threads gave us a median response time of
200ms and a 99-percentile of 5-800 ms (depending on test run), single
thread has median 30ms and 99-percentile 70-130ms. CPU load peaked at
300-400% and IOWait at 30-40%, but was not closely monitored.

Our current vision is to shard the projected 20TB index into ~800GB or
~1TB chunks (depending on which drives we choose) and put one chard on
each physical SSD, thereby sidestepping the whole RAID  TRIM-problem. 

We do have the great luxury of running nightly batch index updates on a
single shard instead of continuous updates. We would probably go for
smaller shards if they were all updated continuously.

Projected price for the full setup range from $50.000-$100.000,
depending on where we land on the off-the-shelf - enterprise scale.

(I need to write a blog post on this)


With that in mind, I urge you to do some testing on a machine with SSD
and modest memory vs. a traditional spinning drives and monster-memory
machine.


- Toke Eskildsen, State and University Library, Denmark




Return the synonyms as part of Solr response

2013-10-30 Thread sivaprasad
Hi, 
We have a requirement where we need to send the matched synonyms as part of
Solr response. 

Do we need to customize the Solr response handler to do this?

Regards,
Siva



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Return-the-synonyms-as-part-of-Solr-response-tp4098389.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud liveness problems

2013-10-30 Thread Jeroen Steggink

Hi,

I experience the same problem, using version 4.4.0.
In my case:
2 Solr nodes - 4 collections, each 1 shard and 2 replicas.
3 Zookeepers

Replicas can get state=down when a connection to Zookeeper is lost. 
However, there are 2 more Zookeeper servers, so this shouldn't be a 
problem right?


The only errors in the log are like the following:
Error inspecting tlog 
tlog{file=/opt/solr/server/blabla/replica1/data/tlog/tlog.0001106 
refcount=2}
Funny thing is, the replicas with the error work just fine, the ones 
without errors are causing problems.
Maybe because the replicas with this error go through the recovery 
process and the others do not?


There seems absolutely no problem with the replicas that are down. The 
only dirty hack to fix things is by editing the clusterstate.json and 
change the state from down to active.

Doesn't seem right, but it does work.

Jeroen

On 18-9-2013 5:50, Mark Miller wrote:

SOLR-5243 and SOLR-5240 will likely improve the situation. Both fixes are in 
4.5 - the first RC for 4.5 will likely come tomorrow.

Thanks to yonik for sussing these out.

- Mark

On Sep 17, 2013, at 2:43 PM, Mark Miller markrmil...@gmail.com wrote:


On Sep 17, 2013, at 12:00 PM, Vladimir Veljkovic 
vladimir.veljko...@boxalino.com wrote:


Hello there,

we have following setup:

SolrCloud 4.4.0 (3 nodes, physical machines)
Zookeeper 3.4.5 (3 nodes, physical machines)

We have a number of rather small collections (~10K or ~100K of documents), that 
we would like to load to all Solr instances (numShards=1, 
replication_factor=3), and access them through local network interface, as the 
load balancing is done in layers above.

We can live (and we actually do it in the test phase) with updating the entire 
collections whenever we need it, switching collection aliases and removing the 
old collections.

We stumbled across following problem: as soon as all three Solr nodes become a 
leader to at least one collection, restarting any node makes it completely 
unresponsive (timeout), both though admin interface and for replication. If we 
restart all solr nodes the cluster end up in some kind of deadlock and only 
remedy we found is Solr clean installation, removing ZooKeeper data and 
re-posting collections.

Apparently, leader is waiting for replicas to come up and they try to 
synchronize but timeout on http requests, so everything ends up in some kind of 
dead lock, maybe related to:

https://issues.apache.org/jira/browse/SOLR-5240

Yup, that sounds exactly what you would expect with SOLR-5240. A fix for that 
is coming in 4.5, which is a probably a week or so away.


Eventually (after few minutes), leader takes over, mark collections active 
but remains blocked on http interface, so other nodes can not synchronize.

In further tests, we loaded 4 collections with numShards=1 and 
replication_factor=2. By chance, one node become the leader for all 4 
collections. Restarting the node which was not the leader is done without the 
problem, but when we restarted the leader it happened that:
- leader shut down, other nodes became leaders of 2 collections each
- leader starts up, 3 collections on it become active, one collection remains 
”down” and node becomes unresponsive and timeouts on http requests.

Hard to say - I'll experiment with 4.5 and see if I can duplicate this.

- Mark


As this behavior is completely unexpected for one cluster solution, I wonder if 
somebody else experienced same problems or we are doing something entirely 
wrong.

Best regards

--

Vladimir Veljkovic
Senior Java Entwickler

Boxalino AG

vladimir.veljko...@boxalino.com
www.boxalino.com


Tuning Kit for your Online Shop

Product Search - Recommendations - Landing Pages - Data intelligence - Mobile 
Commerce







Making a Web Request is failing with 403 Request Forbidden

2013-10-30 Thread Vineet Mishra
Hi All,

I am making web server call to a website for Shortening the links, that is
bit.ly but recieving a 403 Request Forbidden.
Although if I use their webpage to short the web link its working good.
Can any body tell me what might be the reason for such a vague behavior.

Here is the code included.

String url = https://bitly.com/shorten/;;
 StringBuffer response;
try {
URL obj = new URL(url);
 HttpsURLConnection con = (HttpsURLConnection) obj.openConnection();

//add reuqest header
 con.setRequestMethod(POST);
con.setRequestProperty(User-Agent, Mozilla/5.0 (X11; Linux x86_64)
AppleWebKit/537.22 (KHTML, like Gecko) Ubuntu Chromium/25.0.1364.160
Chrome/25.0.1364.160 Safari/537.22);
 con.setRequestProperty(Accept-Language, en-US,en;q=0.8);
con.setRequestProperty(Accept,
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8);
 con.setRequestProperty(Accept-Charset, ISO-8859-1,utf-8;q=0.7,*;q=0.3);
con.setRequestProperty(Content-Type, application/x-www-form-urlencoded);
 con.setRequestProperty(Host, bitly.com);

 String urlParameters = url=
http://bit.ly/1f3aLrPie=utf-8oe=utf-8gws_rd=crei=sKlwUvPbN8j-rAf-5IDwAQbasic_style=1classic_mode=rapid_shorten_mode=_xsrf=a2b71eaf499c4690a77a21d3c87e6302
;

// Send post request
con.setDoOutput(true);
DataOutputStream wr = new DataOutputStream(con.getOutputStream());
 wr.writeBytes(urlParameters);
wr.flush();
wr.close();

int responseCode = con.getResponseCode();
System.out.println(Response Code :  + responseCode);

BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream()));
 String inputLine;
response = new StringBuffer();

while ((inputLine = in.readLine()) != null) {
 response.append(inputLine);
}
in.close();
 System.out.println(response.toString());
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
 e.printStackTrace();
} catch (ProtocolException e) {
// TODO Auto-generated catch block
 e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
 e.printStackTrace();
}

Hoping for your response.

Thanks!


Re: Making a Web Request is failing with 403 Request Forbidden

2013-10-30 Thread Alexandre Rafalovitch
On Wed, Oct 30, 2013 at 4:50 PM, Vineet Mishra clearmido...@gmail.comwrote:


 I am making web server call to a website for Shortening the links, that is
 bit.ly but recieving a 403 Request Forbidden.
 Although if I use their webpage to short the web link its working good.
 Can any body tell me what might be the reason for such a vague behavior.



This does not seem to be a Solr question. Perhaps look at more generic web
request tracing tools like Wireshark, etc to compare valid and failing
request.

If this is Solr related, please narrow this down to Solr aspect of the
problem.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-30 Thread Toke Eskildsen
On Tue, 2013-10-29 at 16:41 +0100, Shawn Heisey wrote:
 If you put the index on SSD, you could get by with less RAM, but a RAID
 solution that works properly with SSD (TRIM support) is hard to find, so
 SSD failure in most situations effectively means a server failure.  Solr
 and Lucene have a track record of shredding SSD into failure, because
 typically there is a LOT of writing involved.

Why would TRIM have any influence on whether or not a driver failure
also means server failure?


If the track record you are referring to involves the problems that the
Jenkins server for Lucene development had, I know of two failed drives
from that setup and they were both OCZ.

No surprise here, it pays to examine the reliability of the different
models before buying. My current rule is to avoid OCZ like the plague
and go for a Samsung 840 or an Intel drive.

http://www.tomshardware.com/reviews/ssd-reliability-failure-rate,2923.html

- Toke Eskildsen, State and University Library, Denmark



Atomic Updates in SOLR

2013-10-30 Thread Anupam Bhattacharya
I am working on a offline tagging capability to tag records with a
thesaurus dictionary of key concepts. I am able to use the update=add
option using xml and json update calls for a field to update specific
document field information. Although if I run the same atomic update query
twice then the multivalued string fields start showing duplicate value in
the multivalued field.
e.g. for a field name as tag at the initial it was having copper, iron,
steel
After running the atomic update query with field name=tag
update=addsteel/field I will get the tag field values as following:
copper, iron, steel, steel. (Thus steel get added twice).
I looked at RemoveDuplicatesTokenFilterFactory but it helps to remove token
duplicate not multivalued field duplicates. Is there any updateProcessor to
stop the incoming duplicate value from indexing ?

Thanks in advance for any help.

Regards
Anupam


query with colon in bq

2013-10-30 Thread jihyun suh
I have a question about query with colon in bq.
Actually I use edismax and I set the q and bq just like this,

.../select?defType=edismaxq=1:100^100 1 100^30qf=Title^2.0
Bodybq=Title:(1:100)^6.0 Body:(1:100)^6.0

in this query phrase, I got the error in bq, undefined field 1.
How do I use query with colon in bq?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/query-with-colon-in-bq-tp4098400.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Atomic Updates in SOLR

2013-10-30 Thread Shalin Shekhar Mangar
Perhaps you are running the update request more than once accidentally?

Can you try using optimistic update with _version_ while sending the
update? This way, if some part of your code is making a duplicate request
then Solr would throw an error.

See
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents


On Wed, Oct 30, 2013 at 3:35 PM, Anupam Bhattacharya anupam...@gmail.comwrote:

 I am working on a offline tagging capability to tag records with a
 thesaurus dictionary of key concepts. I am able to use the update=add
 option using xml and json update calls for a field to update specific
 document field information. Although if I run the same atomic update query
 twice then the multivalued string fields start showing duplicate value in
 the multivalued field.
 e.g. for a field name as tag at the initial it was having copper, iron,
 steel
 After running the atomic update query with field name=tag
 update=addsteel/field I will get the tag field values as following:
 copper, iron, steel, steel. (Thus steel get added twice).
 I looked at RemoveDuplicatesTokenFilterFactory but it helps to remove token
 duplicate not multivalued field duplicates. Is there any updateProcessor to
 stop the incoming duplicate value from indexing ?

 Thanks in advance for any help.

 Regards
 Anupam




-- 
Regards,
Shalin Shekhar Mangar.


Re: Atomic Updates in SOLR

2013-10-30 Thread Anshum Gupta
I am not sure if optimistic concurrency would help in deduplicating but
yes, as Shalin points out, you'll be able to spot issues with your client
code.




On Wed, Oct 30, 2013 at 4:18 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Perhaps you are running the update request more than once accidentally?

 Can you try using optimistic update with _version_ while sending the
 update? This way, if some part of your code is making a duplicate request
 then Solr would throw an error.

 See

 https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents


 On Wed, Oct 30, 2013 at 3:35 PM, Anupam Bhattacharya anupam...@gmail.com
 wrote:

  I am working on a offline tagging capability to tag records with a
  thesaurus dictionary of key concepts. I am able to use the update=add
  option using xml and json update calls for a field to update specific
  document field information. Although if I run the same atomic update
 query
  twice then the multivalued string fields start showing duplicate value in
  the multivalued field.
  e.g. for a field name as tag at the initial it was having copper, iron,
  steel
  After running the atomic update query with field name=tag
  update=addsteel/field I will get the tag field values as following:
  copper, iron, steel, steel. (Thus steel get added twice).
  I looked at RemoveDuplicatesTokenFilterFactory but it helps to remove
 token
  duplicate not multivalued field duplicates. Is there any updateProcessor
 to
  stop the incoming duplicate value from indexing ?
 
  Thanks in advance for any help.
 
  Regards
  Anupam
 



 --
 Regards,
 Shalin Shekhar Mangar.




-- 

Anshum Gupta
http://www.anshumgupta.net


Re: Atomic Updates in SOLR

2013-10-30 Thread Shalin Shekhar Mangar
Ah I misread your email. You are actually sending the update twice and
asking about how to dedup the multi-valued field values.

No I don't think we have an update processor which can do that.


On Wed, Oct 30, 2013 at 4:18 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Perhaps you are running the update request more than once accidentally?

 Can you try using optimistic update with _version_ while sending the
 update? This way, if some part of your code is making a duplicate request
 then Solr would throw an error.

 See
 https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents


 On Wed, Oct 30, 2013 at 3:35 PM, Anupam Bhattacharya 
 anupam...@gmail.comwrote:

 I am working on a offline tagging capability to tag records with a
 thesaurus dictionary of key concepts. I am able to use the update=add
 option using xml and json update calls for a field to update specific
 document field information. Although if I run the same atomic update query
 twice then the multivalued string fields start showing duplicate value in
 the multivalued field.
 e.g. for a field name as tag at the initial it was having copper, iron,
 steel
 After running the atomic update query with field name=tag
 update=addsteel/field I will get the tag field values as following:
 copper, iron, steel, steel. (Thus steel get added twice).
 I looked at RemoveDuplicatesTokenFilterFactory but it helps to remove
 token
 duplicate not multivalued field duplicates. Is there any updateProcessor
 to
 stop the incoming duplicate value from indexing ?

 Thanks in advance for any help.

 Regards
 Anupam




 --
 Regards,
 Shalin Shekhar Mangar.




-- 
Regards,
Shalin Shekhar Mangar.


Re: Atomic Updates in SOLR

2013-10-30 Thread Anshum Gupta
Think it'll be a good thing to have.
I just created a JIRA for that.
https://issues.apache.org/jira/browse/SOLR-5403

Will try and get to it soon.


On Wed, Oct 30, 2013 at 4:28 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Ah I misread your email. You are actually sending the update twice and
 asking about how to dedup the multi-valued field values.

 No I don't think we have an update processor which can do that.


 On Wed, Oct 30, 2013 at 4:18 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

  Perhaps you are running the update request more than once accidentally?
 
  Can you try using optimistic update with _version_ while sending the
  update? This way, if some part of your code is making a duplicate request
  then Solr would throw an error.
 
  See
 
 https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
 
 
  On Wed, Oct 30, 2013 at 3:35 PM, Anupam Bhattacharya 
 anupam...@gmail.comwrote:
 
  I am working on a offline tagging capability to tag records with a
  thesaurus dictionary of key concepts. I am able to use the update=add
  option using xml and json update calls for a field to update specific
  document field information. Although if I run the same atomic update
 query
  twice then the multivalued string fields start showing duplicate value
 in
  the multivalued field.
  e.g. for a field name as tag at the initial it was having copper, iron,
  steel
  After running the atomic update query with field name=tag
  update=addsteel/field I will get the tag field values as following:
  copper, iron, steel, steel. (Thus steel get added twice).
  I looked at RemoveDuplicatesTokenFilterFactory but it helps to remove
  token
  duplicate not multivalued field duplicates. Is there any updateProcessor
  to
  stop the incoming duplicate value from indexing ?
 
  Thanks in advance for any help.
 
  Regards
  Anupam
 
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 



 --
 Regards,
 Shalin Shekhar Mangar.




-- 

Anshum Gupta
http://www.anshumgupta.net


Re: Background merge errors with Solr 4.4.0 on Optimize call

2013-10-30 Thread Erick Erickson
Robert:

Thanks. I'm on my way out the door, so I'll have to put up a JIRA with your
patch later if it hasn't been done already

Erick


On Tue, Oct 29, 2013 at 10:14 PM, Robert Muir rcm...@gmail.com wrote:

 I think its a bug, but thats just my opinion. i sent a patch to dev@
 for thoughts.

 On Tue, Oct 29, 2013 at 6:09 PM, Erick Erickson erickerick...@gmail.com
 wrote:
  Hmmm, so you're saying that merging indexes where a field
  has been removed isn't handled. So you have some documents
  that do have a what field, but your schema doesn't have it,
  is that true?
 
  It _seems_ like you could get by by putting the _what_ field back
  into your schema, just not sending any data to it in new docs.
 
  I'll let others who understand merging better than me chime in on
  whether this is a case that should be handled or a bug. I pinged the
  dev list to see what the opinion is
 
  Best,
  Erick
 
 
  On Mon, Oct 28, 2013 at 6:39 PM, Matthew Shapiro m...@mshapiro.net
 wrote:
 
  Sorry for reposting after I just sent in a reply, but I just looked at
 the
  error trace closer and noticed
 
 
 1. Caused by: java.lang.IllegalArgumentException: no such field what
 
 
  The 'what' field was removed by request of the customer as they wanted
 the
  logic behind what gets queried in the what field to be code side
 instead
  of solr side (for easier changing without having to re-index
 everything.  I
  didn't feel strongly either way and since they are paying me, I took it
  out).
 
  This makes me wonder if its crashing while merging because a field that
  used to be there is now gone.  However, this seems odd to me as Solr
  doesn't even let me delete the old data and instead its leaving my
  collection in an extremely bad state, with the only remedy I can think
 of
  is to nuke the index at the filesystem level.
 
  If this is indeed the cause of the crash, is the only way to delete a
 field
  to first completely empty your index first?
 
 
  On Mon, Oct 28, 2013 at 6:34 PM, Matthew Shapiro m...@mshapiro.net
 wrote:
 
   Thanks for your response.
  
   You were right, solr is logging to the catalina.out file for tomcat.
   When
   I click the optimize button in solr's admin interface the following
 logs
   are written: http://apaste.info/laup
  
   About JVM memory, solr's admin interface is listing JVM memory at 3.1%
   (221.7MB is dark grey, 512.56MB light grey and 6.99GB total).
  
  
   On Mon, Oct 28, 2013 at 6:29 AM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
  
   For Tomcat, the Solr is often put into catalina.out
   as a default, so the output might be there. You can
   configure Solr to send the logs most anywhere you
   please, but without some specific setup
   on your part the log output just goes to the default
   for the servlet.
  
   I took a quick glance at the code but since the merges
   are happening in the background, there's not much
   context for where that error is thrown.
  
   How much memory is there for the JVM? I'm grasping
   at straws a bit...
  
   Erick
  
  
   On Sun, Oct 27, 2013 at 9:54 PM, Matthew Shapiro m...@mshapiro.net
  wrote:
  
I am working at implementing solr to work as the search backend for
  our
   web
system.  So far things have been going well, but today I made some
   schema
changes and now things have broken.
   
I updated the schema.xml file and reloaded the core (via the admin
interface).  No errors were reported in the logs.
   
I then pushed 100 records to be indexed.  A call to Commit
 afterwards
seemed fine, however my next call for Optimize caused the following
   errors:
   
java.io.IOException: background merge hit exception:
_2n(4.4):C4263/154 _30(4.4):C134 _32(4.4):C10 _31(4.4):C10 into _37
[maxNumSegments=1]
   
null:java.io.IOException: background merge hit exception:
_2n(4.4):C4263/154 _30(4.4):C134 _32(4.4):C10 _31(4.4):C10 into _37
[maxNumSegments=1]
   
   
Unfortunately, googling for background merge hit exception came up
with 2 thing: a corrupt index or not enough free space.  The host
machine that's hosting solr has 227 out of 229GB free (according
 to df
-h), so that's not it.
   
   
I then ran CheckIndex on the index, and got the following results:
http://apaste.info/gmGU
   
   
As someone who is new to solr and lucene, as far as I can tell this
means my index is fine. So I am coming up at a loss. I'm somewhat
 sure
that I could probably delete my data directory and rebuild it but
 I am
more interested in finding out why is it having issues, what is the
best way to fix it, and what is the best way to prevent it from
happening when this goes into production.
   
   
Does anyone have any advice that may help?
   
   
As an aside, i do not have a stacktrace for you because the solr
 admin
page isn't giving me one.  I tried looking in my logs file in my
 solr
directory, but it does not contain any logs.  I 

Store Solr OpenBitSets In Solr Indexes

2013-10-30 Thread David Philip
Hi All,

What should be the field type if I have to save solr's open bit set value
within solr document object and retrieve it later for search?

  OpenBitSet bits = new OpenBitSet();

  bits.set(0);
  bits.set(1000);

  doc.addField(SolrBitSets, bits);


What should be the field type of  SolrBitSets?

Thanks


Re: Data import handler with multi tables

2013-10-30 Thread Stefan Matheis
that is what i'd call a compound key? :) using multiple attribute to generate a 
unique key across multiple tables ..


On Wednesday, October 30, 2013 at 2:10 AM, dtphat wrote:

 yes, I've just used concat(id, '_', tableName) instead using compound key. I
 think this is an easy way.
 Thanks.
 
 
 
 -
 Phat T. Dong
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Re-Data-import-handler-with-multi-tables-tp4098048p4098328.html
 Sent from the Solr - User mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 




Re: Atomic Updates in SOLR

2013-10-30 Thread Jack Krupansky
Unfortunately, atomic add is add to a list (append) rather than add to a 
set (only unique values). But, you can use the unique fields update 
processor (solr.UniqFieldsUpdateProcessorFactory) to de-dupe specified 
multivalued fields.


See:
http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/UniqFieldsUpdateProcessorFactory.html

My e-book has more examples as well.

-- Jack Krupansky

-Original Message- 
From: Anupam Bhattacharya

Sent: Wednesday, October 30, 2013 6:05 AM
To: solr-user@lucene.apache.org
Subject: Atomic Updates in SOLR

I am working on a offline tagging capability to tag records with a
thesaurus dictionary of key concepts. I am able to use the update=add
option using xml and json update calls for a field to update specific
document field information. Although if I run the same atomic update query
twice then the multivalued string fields start showing duplicate value in
the multivalued field.
e.g. for a field name as tag at the initial it was having copper, iron,
steel
After running the atomic update query with field name=tag
update=addsteel/field I will get the tag field values as following:
copper, iron, steel, steel. (Thus steel get added twice).
I looked at RemoveDuplicatesTokenFilterFactory but it helps to remove token
duplicate not multivalued field duplicates. Is there any updateProcessor to
stop the incoming duplicate value from indexing ?

Thanks in advance for any help.

Regards
Anupam 



Re: Atomic Updates in SOLR

2013-10-30 Thread Jack Krupansky
Oops... need to note that the parameters have changed since Solr 4.4 - I 
gave the link for 4.5.1, but for 4.4 and earlier, use:


http://lucene.eu.apache.org/solr/4_4_0/solr-core/org/apache/solr/update/processor/UniqFieldsUpdateProcessorFactory.html

(My book is for 4.4, but hasn't been updated for 4.5 yet, but the gist of 
the examples is the same.)


-- Jack Krupansky

-Original Message- 
From: Jack Krupansky

Sent: Wednesday, October 30, 2013 9:03 AM
To: solr-user@lucene.apache.org
Subject: Re: Atomic Updates in SOLR

Unfortunately, atomic add is add to a list (append) rather than add to a
set (only unique values). But, you can use the unique fields update
processor (solr.UniqFieldsUpdateProcessorFactory) to de-dupe specified
multivalued fields.

See:
http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/UniqFieldsUpdateProcessorFactory.html

My e-book has more examples as well.

-- Jack Krupansky

-Original Message- 
From: Anupam Bhattacharya

Sent: Wednesday, October 30, 2013 6:05 AM
To: solr-user@lucene.apache.org
Subject: Atomic Updates in SOLR

I am working on a offline tagging capability to tag records with a
thesaurus dictionary of key concepts. I am able to use the update=add
option using xml and json update calls for a field to update specific
document field information. Although if I run the same atomic update query
twice then the multivalued string fields start showing duplicate value in
the multivalued field.
e.g. for a field name as tag at the initial it was having copper, iron,
steel
After running the atomic update query with field name=tag
update=addsteel/field I will get the tag field values as following:
copper, iron, steel, steel. (Thus steel get added twice).
I looked at RemoveDuplicatesTokenFilterFactory but it helps to remove token
duplicate not multivalued field duplicates. Is there any updateProcessor to
stop the incoming duplicate value from indexing ?

Thanks in advance for any help.

Regards
Anupam 



Unable to add mahout classifier

2013-10-30 Thread lovely kasi
Hi,

I made few changes to the solrconfig.xml, created a jar file,added it to
the lib folder of the solr and tried to start it.

THe changes in the solrconfig.xml are

updateRequestProcessorChain name=mahoutclassifier default=true
  processor class=com.mahout.solr.classifier.CategorizeDocumentFac
str name=inputFieldLEAD_NOTES/str
str name=outputFieldcategory/str
str name=defaultCategoryOthers/str
str name=modelnaiveBayesModel/str
  /processor
  processor class=solr.RunUpdateProcessorFactory/
  processor class=solr.LogUpdateProcessorFactory/
/updateRequestProcessorChain

requestHandler name=/update/csv class=solr.CSVRequestHandler
lst name=defaults
 str name=stream.contentTypeapplication/csv/str
  str name=update.processormahoutclassifier/str
   /lst
  /requestHandler

I attahced the class file.

But i get the following error.

org.apache.solr.common.SolrException: Error Instantiating
UpdateRequestProcessorFactory,
com.mahout.solr.classifier.CategorizeDocumentFactory failed to instantiate
org.apache.solr.update.processor.UpdateRequestProcessorFactory
at org.apache.solr.core.SolrCore.init(SolrCore.java:834)
at org.apache.solr.core.SolrCore.init(SolrCore.java:625)
at
org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:522)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:557)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:247)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:239)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.solr.common.SolrException: Error Instantiating
UpdateRequestProcessorFactory,
com.mahout.solr.classifier.CategorizeDocumentFactory failed to instantiate
org.apache.solr.update.processor.UpdateRequestProcessorFactory
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:547)
at
org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:582)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2144)
at
org.apache.solr.update.processor.UpdateRequestProcessorChain.init(UpdateRequestProcessorChain.java:119)
at
org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:584)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2128)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2122)
at
org.apache.solr.core.SolrCore.loadUpdateProcessorChains(SolrCore.java:906)
at org.apache.solr.core.SolrCore.init(SolrCore.java:766)
... 13 more
Caused by: java.lang.ClassCastException: class
com.mahout.solr.classifier.CategorizeDocumentFactory
at java.lang.Class.asSubclass(Unknown Source)
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:433)
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:381)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:526)
... 21 more


Thanks,
Subbu


Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-30 Thread eShard
Wow again! 
Thank you all very much for your insights.  
We will certainly take all of this under consideration.

Erik: I want to upgrade but unfortunately, it's not up to me. You're right,
we definitely need to do it.  
And SolrJ sounds interesting, thanks for the suggestions.

By the way, is there a Solr upgrade guide out there anywhere?


Thanks again!





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Configuration-and-specs-to-index-a-1-terabyte-TB-repository-tp4098227p4098431.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Language detection for multivalued field

2013-10-30 Thread Jan Høydahl
Hi,

First, the feature will only detect ONE language per field, even if it is a 
multi-valued field. In your case there is VERY little text for the detector, so 
do not expect great detection quality. But I believe the detector chose ES as 
language and mapped the whole field as tag_es. The reason you do not see tag_es 
in the first schema version is naturally because you have it defined as 
stored=false.

If you want individual detection of each value, please send the values in 
differently named fields, of file a JIRA to add a feature request for 
individual detection of language for values in a multiValued field.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

22. okt. 2013 kl. 14:16 skrev vatuska vatu...@yandex.ru:

 *Can you elaborate on your comment There isn't tag indexed. Are you saying
 that your multiValued tag field is not indexed at all, gone, missing? *
 There aren't any tag_... field despite of indexed=true stored=true for
 dynamicField 
 
 I found the reason, but I don't understand why
 If I specify
 str name=langid.whitelisten,es/str 
 
 There aren't any tag_... field for document
 ...
 field name=tagespañol/field 
 field name=tagfirst/field 
 field name=tagMy tag/field
 ...
 
 If there are these lines in schema.xml 
 dynamicField name=quot;*_undfndquot; type=quot;text_generalquot;
 indexed=lt;btrue* stored=true multiValued=true/dynamicField
 name=quot;*_enquot; type=quot;text_en_splittingquot;
 indexed=lt;btrue* stored=true multiValued=true/
 dynamicField name=quot;*_esquot; type=quot;text_esquot;
 indexed=quot;truequot; stored=lt;bfalse* multiValued=true/ 
 
 But if I specify
 dynamicField name=quot;*_esquot; type=quot;text_esquot;
 indexed=quot;truequot; stored=lt;btrue* multiValued=true/ 
 
 There is a *tag_es* : español , first,  My tag
 in the stored document
 
 Could you explain, please, how does it work? 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Language-detection-for-multivalued-field-tp4096996p4097013.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-30 Thread Shawn Heisey
On 10/30/2013 4:00 AM, Toke Eskildsen wrote:
 On Tue, 2013-10-29 at 16:41 +0100, Shawn Heisey wrote:
 If you put the index on SSD, you could get by with less RAM, but a RAID
 solution that works properly with SSD (TRIM support) is hard to find, so
 SSD failure in most situations effectively means a server failure.  Solr
 and Lucene have a track record of shredding SSD into failure, because
 typically there is a LOT of writing involved.
 
 Why would TRIM have any influence on whether or not a driver failure
 also means server failure?

I left out a step in my description.

Lack of TRIM support in RAID means that I would avoid RAID with SSD.  No
RAID means that when the SSD fails, that Solr is out of commission until
its SSD can be replaced.  If you've got multiple replicas and good error
alarming, then that won't pose a major issue.

I don't know how Solr would behave if you put each core on its own SSD
and one of them fails.  Hopefully it's smart enough to keep going with
the cores that have working filesystems.

Thanks,
Shawn



Re: query with colon in bq

2013-10-30 Thread Jack Krupansky
Escape any special characters with a backslash, or put the full term in 
quotes.


-- Jack Krupansky

-Original Message- 
From: jihyun suh

Sent: Wednesday, October 30, 2013 6:28 AM
To: solr-user@lucene.apache.org
Subject: query with colon in bq

I have a question about query with colon in bq.
Actually I use edismax and I set the q and bq just like this,

.../select?defType=edismaxq=1:100^100 1 100^30qf=Title^2.0
Bodybq=Title:(1:100)^6.0 Body:(1:100)^6.0

in this query phrase, I got the error in bq, undefined field 1.
How do I use query with colon in bq?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/query-with-colon-in-bq-tp4098400.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Indexing logs files of thousands of GBs

2013-10-30 Thread keshari.prerna
Hello,

As suggested by Chris, now I am accessing the files using java program and
creating SolrInputDocument, but i ran into this exception while doing
server.add(document). When i tried to increase ramBufferSizeMB, it doesn't
let me make it more than 2 gig.

org.apache.solr.client.solrj.SolrServerException: Server at
http://localhost:8983/solr/logsIndexing returned non ok status:500,
message:the request was rejected because its size (2097454) exceeds the
configured maximum (2097152) 
org.apache.commons.fileupload.FileUploadBase$SizeLimitExceededException: the
request was rejected because its size (2097454) exceeds the configured
maximum (2097152)   at
org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl$1.raiseError(FileUploadBase.java:902)
  
at
org.apache.commons.fileupload.util.LimitedInputStream.checkLimit(LimitedInputStream.java:71)
  
at
org.apache.commons.fileupload.util.LimitedInputStream.read(LimitedInputStream.java:128)
  
at
org.apache.commons.fileupload.MultipartStream$ItemInputStream.makeAvailable(MultipartStream.java:977)
  
at
org.apache.commons.fileupload.MultipartStream$ItemInputStream.read(MultipartStream.java:887)
  
at java.io.InputStream.read(Unknown Source) at
org.apache.commons.fileupload.util.Streams.copy(Streams.java:94)at
org.apache.commons.fileupload.util.Streams.copy(Streams.java:64)at
org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:362)
  
at
org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
  
at
org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344)
  
at
org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397)
  
at
org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115)  
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
  
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
  
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)  
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)  
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)  
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)  
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)  
at org.mortbay.jetty.handler.ContextHand
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:121)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:106)
at Filewalker.walk(LogsIndexer.java:48)
at Filewalker.main(LogsIndexer.java:69)

How do I get rid of this?

Thanks,
Prerna



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073p4098438.html
Sent from the Solr - User mailing list archive at Nabble.com.


Computing Results So That They are Returned in Search Results

2013-10-30 Thread Alejandro Calbazana
I'd like to throw out a design question and see if its possible to solve
this with Solr.

I have a set of data that is computed that I'd like to make searchable.
Ideally, I'd like to have all documents indexed and call it the day, but
the nature of the data is such that it needs to be computed given a
definition.  I'm interested in searching on definitions and then creating
results on the fly that are calculated based on something embedded in the
definition.

Is it possible to embed this calculation login into Solr's result handling
process?  I know this sounds exotic, but the nature of the data is such
that I can't index these calculated documents because I don't know what the
boundary is and specifiying an arbitrary number isn't ideal.

Has anyone run across something like this?

Thanks,

Alejandr


Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-30 Thread Toke Eskildsen
On Wed, 2013-10-30 at 14:24 +0100, Shawn Heisey wrote:
 On 10/30/2013 4:00 AM, Toke Eskildsen wrote:
  Why would TRIM have any influence on whether or not a driver failure
  also means server failure?
 
 I left out a step in my description.
 
 Lack of TRIM support in RAID means that I would avoid RAID with SSD.  No
 RAID means that when the SSD fails, that Solr is out of commission until
 its SSD can be replaced.

That makes sense, thanks.

 I don't know how Solr would behave if you put each core on its own SSD
 and one of them fails.  Hopefully it's smart enough to keep going with
 the cores that have working filesystems.

I don't know either. Seems like it would be a useful thing to test.

We did some comparison on 9 shards of 420GB (against a SAN), where we
tested SolrCloud with 9 independent Solr instances vs. a single instance
with multiple cores. The overhead of independent instances did not seem
severe for that shard size and should be resilient against single drive
failure.

As we're looking at a cumulative heap requirement of 100GB+ due to
grouping and faceting, it might be preferable to run with independent
Solrs anyway to minimize garbage collection pauses. I do not know if
that logic extends in general to large Solr installations.

Regards,
Toke Eskildsen, State and University Library, Denmark



Evaluating a SOLR index with trec_eval

2013-10-30 Thread Michael Preminger
Hello!

Is there a simple way to evaluate a SOLR index with TREC_EVAL?
I mean: 
*  preparing a query file in some format Solr will understand, but where each 
query has an ID
* getting results out in trec format, with these query IDs attached

Thanks

Michael


[SolrCloud-Solrj] Document router problem connecting to Zookeeper ensemble

2013-10-30 Thread Alessandro Benedetti
I have a zookeeper ensemble hotes in one amazon server.
Using the CloudSolrServer and trying to connect , I obtain this nreally
unusual error :

969 [main] INFO org.apache.solr.common.cloud.ConnectionManager - Client is
connected to ZooKeeper
1043 [main] INFO org.apache.solr.common.cloud.ZkStateReader - Updating
cluster state from ZooKeeper...
Exception in thread main org.apache.solr.common.SolrException: Unknown
document router '{name=implicit}'
at org.apache.solr.common.cloud.DocRouter.getDocRouter(DocRouter.java:46)

Although in my collection I have the compositeId strategy for routing (
from the clusterState.json ) .

This is how I instantiate the server :

CloudSolrServer server;
server = new CloudSolrServer(
ec2-xx.xx.xx.eu-west-1.compute.amazonaws.com:2181,
ec2-xx.xx.xx.eu-west-1.compute.amazonaws.com:2182,
ec2-xx.xx.xx.eu-west-1.compute.amazonaws.com:2183);
server.setDefaultCollection(example);
SolrPingResponse ping = server.ping();

Any hint ?
-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England


SolrCloud batch updates

2013-10-30 Thread michael.boom
I'm currently using a SolrCoud setup and I index my data using a couple of
in-house indexing clients.
The clients process some files and post json messages containing added
documents in batches.
Initially my batch size was 100k docs and the post request took about 20-30
secs.
I switched to 10k batches and now the updates are much faster but also more
in number.

My commit settings are :
- autocommit - 45s / 100k docs, openSearcher=false 
- softAutoCommit - every 3 minutes

I'm trying to figure out which one is preferable - bigger batches, rare or
smaller batches, often? And why?
Which are the background operations that take place after posting docs? 
At which point does the replication kick in - after commit or after update?






-
Thanks,
Michael
--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-batch-updates-tp4098463.html
Sent from the Solr - User mailing list archive at Nabble.com.


Problem querying with edismax and hyphens

2013-10-30 Thread Vardhan Dharnidharka



Hi,

The query z-score doesn't match a doc with zscore in the index.  The analysis 
tool shows that this query would match this data in the index, but it's the 
edismax query parser step that seems to screw things up.  Is there some 
combination of autoGeneratePhraseQueries, WordDelimiterFilterFactory 
parameters, and/or something else I can change or add to generically make the 
query match without modifying the mm?  ie. without adding a rule to 
specifically synonymize or split the term zscore with some dictionary of 
words.The query I want to match but doesn't:z-scoremm=-30%In the 
index:zscoreThe analyzer:  fieldType autoGeneratePhraseQueries=false 
class=solr.TextField name=lowStopText positionIncrementGap=100 
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter catenateAll=1 catenateNumbers=1 catenateWords=1 
class=solr.WordDelimiterFilterFactory preserveOriginal=1 
splitOnCaseChange=0 splitOnNumerics=0 types=wdfftypes.txt/
filter class=solr.ICUFoldingFilterFactory/  
   /analyzer analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter catenateAll=1 catenateNumbers=1 catenateWords=1 
class=solr.WordDelimiterFilterFactory preserveOriginal=1 
splitOnCaseChange=0 splitOnNumerics=0 types=wdfftypes.txt/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.StopFilterFactory enablePositionIncrements=true 
ignoreCase=true words=stopwords.txt/  
   /analyzer
  /fieldTypeThe parsed edismax query with 
autoGeneratePhraseQueries=true:+(def_term:\(z-score z) (score zscore)\)The 
parsed edismax query with autoGeneratePhraseQueries=false:+(((def_term:z-score 
def_term:z def_term:score def_term:zscore)~3))Thanks
Vardhan
  

Problem querying with edismax and hyphens

2013-10-30 Thread Vardhan Dharnidharka
Hi,

The query z-score doesn't match a doc with zscore in the index. The analysis 
tool shows that this query would match this data in the index, but it's the 
edismax query parser step that seems to screw things up. Is there some 
combination of autoGeneratePhraseQueries, WordDelimiterFilterFactory 
parameters, and/or something else I can change or add to generically make the 
query match without modifying the mm? ie. without adding a rule to specifically 
synonymize or split the term zscore with some dictionary of words.

The query I want to match but doesn't:
z-score
mm=-30%

In the index:
zscore

The analyzer: 

fieldType autoGeneratePhraseQueries=false class=solr.TextField 
name=lowStopText positionIncrementGap=100 

analyzer type=index 
   tokenizer class=solr.WhitespaceTokenizerFactory/ 
 filter catenateAll=1 catenateNumbers=1 catenateWords=1 
class=solr.WordDelimiterFilterFactory preserveOriginal=1 
splitOnCaseChange=0 splitOnNumerics=0 types=wdfftypes.txt/ 
 filter class=solr.ICUFoldingFilterFactory/ 
/analyzer 

analyzer type=query 
tokenizer class=solr.WhitespaceTokenizerFactory/ 
filter catenateAll=1 catenateNumbers=1 catenateWords=1 
class=solr.WordDelimiterFilterFactory preserveOriginal=1 
splitOnCaseChange=0 splitOnNumerics=0 types=wdfftypes.txt/ 
filter class=solr.ICUFoldingFilterFactory/ 
filter class=solr.StopFilterFactory enablePositionIncrements=true 
ignoreCase=true words=stopwords.txt/
/analyzer 
/fieldType

The parsed edismax query with autoGeneratePhraseQueries=true:
+(def_term:\(z-score z) (score zscore)\)

The parsed edismax query with autoGeneratePhraseQueries=false:
+(((def_term:z-score def_term:z def_term:score def_term:zscore)~3))

Thanks
Vardhan   

Re: Computing Results So That They are Returned in Search Results

2013-10-30 Thread Jack Krupansky
You could create a custom value source and then use it in a function query 
embedded in your return fields list (fl).


So, the function query could use a function (value source) that takes a 
field, fetches its value, performs some arbitrary calculation, and then 
returns that value.


fl=id,name,my-func(field1),my-func(field2)

-- Jack Krupansky

-Original Message- 
From: Alejandro Calbazana

Sent: Wednesday, October 30, 2013 10:10 AM
To: solr-user@lucene.apache.org
Subject: Computing Results So That They are Returned in Search Results

I'd like to throw out a design question and see if its possible to solve
this with Solr.

I have a set of data that is computed that I'd like to make searchable.
Ideally, I'd like to have all documents indexed and call it the day, but
the nature of the data is such that it needs to be computed given a
definition.  I'm interested in searching on definitions and then creating
results on the fly that are calculated based on something embedded in the
definition.

Is it possible to embed this calculation login into Solr's result handling
process?  I know this sounds exotic, but the nature of the data is such
that I can't index these calculated documents because I don't know what the
boundary is and specifiying an arbitrary number isn't ideal.

Has anyone run across something like this?

Thanks,

Alejandr 



Re: solr 4.5.0 configuration Error: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load config file .../solrconfig.xml

2013-10-30 Thread Shawn Heisey

On 10/30/2013 9:24 AM, Elena Camossi wrote:

Hi everyone,

I'm trying to configure Solr 4.5.0 on Linux red Hat to work with CKAN and
Tomcat, but Solr cannot initialize the core (I'm configuring just one core,
but this is likely to change in the next future. I'm using contexts for this
set up). Tomcat is working correctly, and list solr among running
applications.  When I open the Solr dashboard, the Solr instance is running
but I see this error

SolrCore Initialization Failures
ckan-schema-2.0:
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Could not load config file /usr/share/solr/ckan/conf/solrconfig.xml


snip


The content of my solr.xml  for core settings (/usr/share/solr/solr.xml, in
my installation) is

solr persistent=true sharedLib=lib
 cores adminPath=/admin/cores defaultCoreName=ckan
 core name =ckan-schema-2.0 instanceDir=ckan/conf property
name=dataDir value=/var/lib/solr/data/ckan //core
 /cores
/solr


Typically, instanceDir will not have the conf on it - it should just 
be ckan for this.  Solr automatically adds the conf when it is looking 
for the configuration.


Later you show that you have dataDir defined in solrconfig.xml -- take 
that out entirely.  The dataDir is specified in solr.xml, putting it in 
solrconfig.xml also is just asking for problems -- especially if you 
ever end up sharing the solrconfig.xml between more than one core, which 
is what happens with SolrCloud.  Also, evidence seems to suggest that 
the ${dataDir} substitution that used to work in older versions was a 
fluke.  After a recent rigorous properties cleanup, it is no longer 
supported, unless you actually define that as a java system property.


Finally, make sure that the permissions of all paths leading to both the 
symlink for your conf directory and the actual conf directory are 
readable to the tomcat user, not just root.





Re: Indexing logs files of thousands of GBs

2013-10-30 Thread keshari.prerna
I have set at multipartUploadLimitInKB parameter to 10240 (which was 2048
earlier)

multipartUploadLimitInKB=10240. Now it gives following error for same
files at place.

http://localhost:8983/solr/logsIndexing returned non ok status:500,
message:the request was rejected because its size (10486046) exceeds the
configured maximum (10485760).



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073p4098472.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: solr 4.5.0 configuration Error: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load config file .../solrconfig.xml

2013-10-30 Thread Elena Camossi
Dear Shawn,

thanks a lot for your quick answer. 

 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: mercoledì 30 ottobre 2013 17:12
 To: solr-user@lucene.apache.org
 Subject: Re: solr 4.5.0 configuration Error:
 org.apache.solr.common.SolrException:org.apache.solr.common.SolrExcepti
 on: Could not load config file .../solrconfig.xml
 
 On 10/30/2013 9:24 AM, Elena Camossi wrote:
  Hi everyone,
 
  I'm trying to configure Solr 4.5.0 on Linux red Hat to work with CKAN
  and Tomcat, but Solr cannot initialize the core (I'm configuring just
  one core, but this is likely to change in the next future. I'm using
  contexts for this set up). Tomcat is working correctly, and list solr
  among running applications.  When I open the Solr dashboard, the Solr
  instance is running but I see this error
 
  SolrCore Initialization Failures
  ckan-schema-2.0:
 
 org.apache.solr.common.SolrException:org.apache.solr.common.SolrExcepti
 on:
  Could not load config file /usr/share/solr/ckan/conf/solrconfig.xml
 
 snip
 
  The content of my solr.xml  for core settings
  (/usr/share/solr/solr.xml, in my installation) is
 
  solr persistent=true sharedLib=lib
   cores adminPath=/admin/cores defaultCoreName=ckan
   core name =ckan-schema-2.0 instanceDir=ckan/conf
  property name=dataDir value=/var/lib/solr/data/ckan //core
   /cores
  /solr
 
 Typically, instanceDir will not have the conf on it - it should just be
ckan
 for this.  Solr automatically adds the conf when it is looking for the
 configuration.

 
 Later you show that you have dataDir defined in solrconfig.xml -- take
that
 out entirely.  The dataDir is specified in solr.xml, putting it in
solrconfig.xml
 also is just asking for problems -- especially if you ever end up sharing
the
 solrconfig.xml between more than one core, which is what happens with
 SolrCloud.  Also, evidence seems to suggest that the ${dataDir}
substitution
 that used to work in older versions was a fluke.  After a recent rigorous
 properties cleanup, it is no longer supported, unless you actually define
that
 as a java system property.


Actually, I had tried the instanceDir=ckan but it didn't work either (with
the same error, just reporting a wrong path to solrconf.xml).
I used this configuration taking suggestion from here
http://stackoverflow.com/questions/16230493/apache-solr-unable-to-access-adm
in-page).
But now that I have commented the dataDir setting in solrconf.xml as you
suggest, it changes behaviour and i have a different error from Solr
Logging:



SolrCore Initialization Failures

ckan-schema-2.0:
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Error loading class 'solr.clustering.ClusteringComponent' 

Please check your logs for more information
Log4j (org.slf4j.impl.Log4jLoggerFactory)
TimeLevel   Logger  Message
17:36:43WARNSolrResourceLoader  Can't find (or read)
directory to add to classloader: ../../../contrib/extraction/lib (resolved
as: /usr/share/solr/ckan/../../../contrib/extraction/lib).
17:36:43WARNSolrResourceLoader  Can't find (or read)
directory to add to classloader: ../../../dist/ (resolved as:
/usr/share/solr/ckan/../../../dist).
17:36:43WARNSolrResourceLoader  Can't find (or read)
directory to add to classloader: ../../../contrib/clustering/lib/ (resolved
as: /usr/share/solr/ckan/../../../contrib/clustering/lib).
17:36:43WARNSolrResourceLoader  Can't find (or read)
directory to add to classloader: ../../../dist/ (resolved as:
/usr/share/solr/ckan/../../../dist).
17:36:43WARNSolrResourceLoader  Can't find (or read)
directory to add to classloader: ../../../contrib/langid/lib/ (resolved as:
/usr/share/solr/ckan/../../../contrib/langid/lib).
17:36:43WARNSolrResourceLoader  Can't find (or read)
directory to add to classloader: ../../../dist/ (resolved as:
/usr/share/solr/ckan/../../../dist).
17:36:43WARNSolrResourceLoader  Can't find (or read)
directory to add to classloader: ../../../contrib/velocity/lib (resolved as:
/usr/share/solr/ckan/../../../contrib/velocity/lib).
17:36:43WARNSolrResourceLoader  Can't find (or read)
directory to add to classloader: ../../../dist/ (resolved as:
/usr/share/solr/ckan/../../../dist).
17:36:44WARNSolrCore[ckan-schema-2.0] Solr index
directory '/var/lib/solr/data/ckan/index' doesn't exist. Creating new
index...
17:36:45ERROR   CoreContainer   Unable to create core:
ckan-schema-2.0
17:36:45ERROR   CoreContainer
null:org.apache.solr.common.SolrException: Unable to create core:
ckan-schema-2.0
null:org.apache.solr.common.SolrException: Unable to create core:
ckan-schema-2.0
at
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:936)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:568)
at 

Re: Computing Results So That They are Returned in Search Results

2013-10-30 Thread Alejandro Calbazana
Sounds really close to what I'm looking for, but this sounds like it would
result in a new field on a document (or a new value for a field defined to
hold the result of a function).  Would it be possible for a function query
to produce a new document so that I can associate the computed value with
it?

Thanks,

Alejandro


On Wed, Oct 30, 2013 at 12:05 PM, Jack Krupansky j...@basetechnology.comwrote:

 You could create a custom value source and then use it in a function
 query embedded in your return fields list (fl).

 So, the function query could use a function (value source) that takes a
 field, fetches its value, performs some arbitrary calculation, and then
 returns that value.

 fl=id,name,my-func(field1),my-**func(field2)

 -- Jack Krupansky

 -Original Message- From: Alejandro Calbazana
 Sent: Wednesday, October 30, 2013 10:10 AM
 To: solr-user@lucene.apache.org
 Subject: Computing Results So That They are Returned in Search Results

 I'd like to throw out a design question and see if its possible to solve
 this with Solr.

 I have a set of data that is computed that I'd like to make searchable.
 Ideally, I'd like to have all documents indexed and call it the day, but
 the nature of the data is such that it needs to be computed given a
 definition.  I'm interested in searching on definitions and then creating
 results on the fly that are calculated based on something embedded in the
 definition.

 Is it possible to embed this calculation login into Solr's result handling
 process?  I know this sounds exotic, but the nature of the data is such
 that I can't index these calculated documents because I don't know what the
 boundary is and specifiying an arbitrary number isn't ideal.

 Has anyone run across something like this?

 Thanks,

 Alejandr



Re: Indexing logs files of thousands of GBs

2013-10-30 Thread Otis Gospodnetic
Hi,

Hm, sorry for not helping with this particular issue directly, but it
looks like you are *uploading* your logs and indexing that way?
Wouldn't pushing them be a better fit when it comes to log indexing?
We recently contributed a Logstash output that can index logs to Solr,
which may be of interest - have a look at
https://twitter.com/otisg/status/395563043045638144 -- includes a
little diagram that shows how this fits into the picture.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/



On Wed, Oct 30, 2013 at 9:55 AM, keshari.prerna
keshari.pre...@gmail.com wrote:
 Hello,

 As suggested by Chris, now I am accessing the files using java program and
 creating SolrInputDocument, but i ran into this exception while doing
 server.add(document). When i tried to increase ramBufferSizeMB, it doesn't
 let me make it more than 2 gig.

 org.apache.solr.client.solrj.SolrServerException: Server at
 http://localhost:8983/solr/logsIndexing returned non ok status:500,
 message:the request was rejected because its size (2097454) exceeds the
 configured maximum (2097152)
 org.apache.commons.fileupload.FileUploadBase$SizeLimitExceededException: the
 request was rejected because its size (2097454) exceeds the configured
 maximum (2097152)   at
 org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl$1.raiseError(FileUploadBase.java:902)
 at
 org.apache.commons.fileupload.util.LimitedInputStream.checkLimit(LimitedInputStream.java:71)
 at
 org.apache.commons.fileupload.util.LimitedInputStream.read(LimitedInputStream.java:128)
 at
 org.apache.commons.fileupload.MultipartStream$ItemInputStream.makeAvailable(MultipartStream.java:977)
 at
 org.apache.commons.fileupload.MultipartStream$ItemInputStream.read(MultipartStream.java:887)
 at java.io.InputStream.read(Unknown Source) at
 org.apache.commons.fileupload.util.Streams.copy(Streams.java:94)at
 org.apache.commons.fileupload.util.Streams.copy(Streams.java:64)at
 org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:362)
 at
 org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
 at
 org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344)
 at
 org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397)
 at
 org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
 at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
 at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
 at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
 at org.mortbay.jetty.handler.ContextHand
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328)
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
 at
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
 at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:121)
 at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:106)
 at Filewalker.walk(LogsIndexer.java:48)
 at Filewalker.main(LogsIndexer.java:69)

 How do I get rid of this?

 Thanks,
 Prerna



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073p4098438.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr 4.5.0 configuration Error: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load config file .../solrconfig.xml

2013-10-30 Thread Shawn Heisey

On 10/30/2013 10:44 AM, Elena Camossi wrote:

Actually, I had tried the instanceDir=ckan but it didn't work either (with
the same error, just reporting a wrong path to solrconf.xml).
I used this configuration taking suggestion from here
http://stackoverflow.com/questions/16230493/apache-solr-unable-to-access-adm
in-page).
But now that I have commented the dataDir setting in solrconf.xml as you
suggest, it changes behaviour and i have a different error from Solr
Logging:



SolrCore Initialization Failures

 ckan-schema-2.0:
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Error loading class 'solr.clustering.ClusteringComponent'

Please check your logs for more information
Log4j (org.slf4j.impl.Log4jLoggerFactory)
TimeLevel   Logger  Message
17:36:43WARNSolrResourceLoader  Can't find (or read)
directory to add to classloader: ../../../contrib/extraction/lib (resolved
as: /usr/share/solr/ckan/../../../contrib/extraction/lib).


Your solrconfig.xml file includes the ClusteringComponent, but you don't 
have the jars required for that component available.  Your solrconfig 
file does have a bunch of lib directives, but they don't point 
anywhere that's valid -- they assume that the entire Solr download is 
available, not just what's in the example dir.  The jar for that 
particular component can be found in the download as 
dist/solr-clustering-X.X.X.jar ... but it is likely to also require 
additional jars, such as those found in contrib/clustering/lib.


When it comes to extra jars for contrib or third-party components, the 
best thing to do is remove all lib directives from solrconfig.xml and 
put the jars in ${solr.solr.home}/lib.  For you that location would be 
/usr/share/solr/lib.  Solr automatically looks in this location without 
any extra configuration.


Further advice - remove things you don't need from your config.  If 
you're not planning to use the clustering component, take it out.  Also 
remove any handlers that refer to components you won't be using -- the 
/browse handler is a prime example of something that most people don't need.


Thanks,
Shawn



Re: Computing Results So That They are Returned in Search Results

2013-10-30 Thread Jack Krupansky
A function query is simply returning a calculated result based on existing 
data - no new fields required.


Did you actually want to precompute a value, store it in the index, and then 
query on it? If so, you could do that indexing with a custom or scripted 
update processor.


Flesh out an example of exactly what you want.

-- Jack Krupansky

-Original Message- 
From: Alejandro Calbazana

Sent: Wednesday, October 30, 2013 12:46 PM
To: solr-user@lucene.apache.org
Subject: Re: Computing Results So That They are Returned in Search Results

Sounds really close to what I'm looking for, but this sounds like it would
result in a new field on a document (or a new value for a field defined to
hold the result of a function).  Would it be possible for a function query
to produce a new document so that I can associate the computed value with
it?

Thanks,

Alejandro


On Wed, Oct 30, 2013 at 12:05 PM, Jack Krupansky 
j...@basetechnology.comwrote:



You could create a custom value source and then use it in a function
query embedded in your return fields list (fl).

So, the function query could use a function (value source) that takes a
field, fetches its value, performs some arbitrary calculation, and then
returns that value.

fl=id,name,my-func(field1),my-**func(field2)

-- Jack Krupansky

-Original Message- From: Alejandro Calbazana
Sent: Wednesday, October 30, 2013 10:10 AM
To: solr-user@lucene.apache.org
Subject: Computing Results So That They are Returned in Search Results

I'd like to throw out a design question and see if its possible to solve
this with Solr.

I have a set of data that is computed that I'd like to make searchable.
Ideally, I'd like to have all documents indexed and call it the day, but
the nature of the data is such that it needs to be computed given a
definition.  I'm interested in searching on definitions and then creating
results on the fly that are calculated based on something embedded in the
definition.

Is it possible to embed this calculation login into Solr's result handling
process?  I know this sounds exotic, but the nature of the data is such
that I can't index these calculated documents because I don't know what 
the

boundary is and specifiying an arbitrary number isn't ideal.

Has anyone run across something like this?

Thanks,

Alejandr





Replacing Google Mini Search Appliance with Solr?

2013-10-30 Thread Palmer, Eric
Hello all,

Been lurking on the list for awhile.

We are at the end of life for replacing two google mini search appliances used 
to index our public web sites. Google is no longer selling the mini appliances 
and buying the big appliance is not cost beneficial.

http://search.richmond.edu/

We would run a solr replacement in linux (cents, redhat, similar) with open 
Java or Oracle Java.

Background
==
~130 sites
only ~12,000 pages (at a depth of 3)
probably ~40,000 pages if we go to a depth of 4

We use key matches a lot. In solr terms these are elevated documents 
(elevations)

We would code a search query form in php and wrap it into our design 
(http://www.richmond.edu)

I have played with and love lucidworks and know that their $ solution works for 
our use cases but the cost model is not attractive for such a small collection.

So with solr what are my open source options and what are people's experiences 
crawling and indexing web sites with solr + crawler. I understand there is not 
a crawler with solr so that would have to be first up to get one working.

We can code in Java, PHP, Python etc. if we have to, but we don't want to write 
a crawler if we can avoid it.

thanks in advance for and information.

--
Eric Palmer
Web Services
U of Richmond



RE: Replacing Google Mini Search Appliance with Solr?

2013-10-30 Thread Markus Jelsma
Hi Eric,

We have also helped some government institution to replave their expensive GSA 
with open source software. In our case we use Apache Nutch 1.7 to crawl the 
websites and index to Apache Solr. It is very effective, robust and scales 
easily with Hadoop if you have to. Nutch may not be the easiest tool for the 
job but is very stable, feature rich and has an active community here at Apache.

Cheers,
 
-Original message-
 From:Palmer, Eric epal...@richmond.edu
 Sent: Wednesday 30th October 2013 18:48
 To: solr-user@lucene.apache.org
 Subject: Replacing Google Mini Search Appliance with Solr?
 
 Hello all,
 
 Been lurking on the list for awhile.
 
 We are at the end of life for replacing two google mini search appliances 
 used to index our public web sites. Google is no longer selling the mini 
 appliances and buying the big appliance is not cost beneficial.
 
 http://search.richmond.edu/
 
 We would run a solr replacement in linux (cents, redhat, similar) with open 
 Java or Oracle Java.
 
 Background
 ==
 ~130 sites
 only ~12,000 pages (at a depth of 3)
 probably ~40,000 pages if we go to a depth of 4
 
 We use key matches a lot. In solr terms these are elevated documents 
 (elevations)
 
 We would code a search query form in php and wrap it into our design 
 (http://www.richmond.edu)
 
 I have played with and love lucidworks and know that their $ solution works 
 for our use cases but the cost model is not attractive for such a small 
 collection.
 
 So with solr what are my open source options and what are people's 
 experiences crawling and indexing web sites with solr + crawler. I understand 
 there is not a crawler with solr so that would have to be first up to get one 
 working.
 
 We can code in Java, PHP, Python etc. if we have to, but we don't want to 
 write a crawler if we can avoid it.
 
 thanks in advance for and information.
 
 --
 Eric Palmer
 Web Services
 U of Richmond
 
 


Re: Replacing Google Mini Search Appliance with Solr?

2013-10-30 Thread Jason Hellman
Nutch is an excellent option.  It should feel very comfortable for people 
migrating away from the Google appliances.

Apache Droids is another possible way to approach, and I’ve found people using 
Heretrix or Manifold for various use cases (and usually in combination with 
other use cases where the extra overhead was worth the trouble).

I think the simples approach will be Nutch…it’s absolutely worth taking a shot 
at it.

DO NOT write a crawler!  That is a rabbit hole you do not want to peer down 
into :)



On Oct 30, 2013, at 10:54 AM, Markus Jelsma markus.jel...@openindex.io wrote:

 Hi Eric,
 
 We have also helped some government institution to replave their expensive 
 GSA with open source software. In our case we use Apache Nutch 1.7 to crawl 
 the websites and index to Apache Solr. It is very effective, robust and 
 scales easily with Hadoop if you have to. Nutch may not be the easiest tool 
 for the job but is very stable, feature rich and has an active community here 
 at Apache.
 
 Cheers,
 
 -Original message-
 From:Palmer, Eric epal...@richmond.edu
 Sent: Wednesday 30th October 2013 18:48
 To: solr-user@lucene.apache.org
 Subject: Replacing Google Mini Search Appliance with Solr?
 
 Hello all,
 
 Been lurking on the list for awhile.
 
 We are at the end of life for replacing two google mini search appliances 
 used to index our public web sites. Google is no longer selling the mini 
 appliances and buying the big appliance is not cost beneficial.
 
 http://search.richmond.edu/
 
 We would run a solr replacement in linux (cents, redhat, similar) with open 
 Java or Oracle Java.
 
 Background
 ==
 ~130 sites
 only ~12,000 pages (at a depth of 3)
 probably ~40,000 pages if we go to a depth of 4
 
 We use key matches a lot. In solr terms these are elevated documents 
 (elevations)
 
 We would code a search query form in php and wrap it into our design 
 (http://www.richmond.edu)
 
 I have played with and love lucidworks and know that their $ solution works 
 for our use cases but the cost model is not attractive for such a small 
 collection.
 
 So with solr what are my open source options and what are people's 
 experiences crawling and indexing web sites with solr + crawler. I 
 understand there is not a crawler with solr so that would have to be first 
 up to get one working.
 
 We can code in Java, PHP, Python etc. if we have to, but we don't want to 
 write a crawler if we can avoid it.
 
 thanks in advance for and information.
 
 --
 Eric Palmer
 Web Services
 U of Richmond
 
 



Re: SolrCloud batch updates

2013-10-30 Thread Anshum Gupta
Hi Michael,

Here's a good post by Erick Erickson about understanding commits and
transaction logs in Solr.
http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

About the replication, as soon as you post an update, here's what happens:
1. The update gets routed to the correct leader
2. The leader writes it to it's transaction log
3. Leader forwards the updates to the replicas.
4. When the replicas respond in positive about the update being successful,
the leader returns a success message for the update.

Hope that helps.


On Wed, Oct 30, 2013 at 9:06 PM, michael.boom my_sky...@yahoo.com wrote:

 I'm currently using a SolrCoud setup and I index my data using a couple of
 in-house indexing clients.
 The clients process some files and post json messages containing added
 documents in batches.
 Initially my batch size was 100k docs and the post request took about 20-30
 secs.
 I switched to 10k batches and now the updates are much faster but also more
 in number.

 My commit settings are :
 - autocommit - 45s / 100k docs, openSearcher=false
 - softAutoCommit - every 3 minutes

 I'm trying to figure out which one is preferable - bigger batches, rare or
 smaller batches, often? And why?
 Which are the background operations that take place after posting docs?
 At which point does the replication kick in - after commit or after update?






 -
 Thanks,
 Michael
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SolrCloud-batch-updates-tp4098463.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 

Anshum Gupta
http://www.anshumgupta.net


Re: Replacing Google Mini Search Appliance with Solr?

2013-10-30 Thread Palmer, Eric
Markus and Jason

thanks for the info.

I will start to research Nutch.  Writing a crawler, agree it is a rabbit
hole.


-- 
Eric Palmer

Web Services
U of Richmond

To report technical issues, obtain technical support or make requests for
enhancements please visit
http://web.richmond.edu/contact/technical-support.html





On 10/30/13 2:53 PM, Jason Hellman jhell...@innoventsolutions.com
wrote:

Nutch is an excellent option.  It should feel very comfortable for people
migrating away from the Google appliances.

Apache Droids is another possible way to approach, and I¹ve found people
using Heretrix or Manifold for various use cases (and usually in
combination with other use cases where the extra overhead was worth the
trouble).

I think the simples approach will be NutchŠit¹s absolutely worth taking a
shot at it.

DO NOT write a crawler!  That is a rabbit hole you do not want to peer
down into :)



On Oct 30, 2013, at 10:54 AM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Hi Eric,
 
 We have also helped some government institution to replave their
expensive GSA with open source software. In our case we use Apache Nutch
1.7 to crawl the websites and index to Apache Solr. It is very
effective, robust and scales easily with Hadoop if you have to. Nutch
may not be the easiest tool for the job but is very stable, feature rich
and has an active community here at Apache.
 
 Cheers,
 
 -Original message-
 From:Palmer, Eric epal...@richmond.edu
 Sent: Wednesday 30th October 2013 18:48
 To: solr-user@lucene.apache.org
 Subject: Replacing Google Mini Search Appliance with Solr?
 
 Hello all,
 
 Been lurking on the list for awhile.
 
 We are at the end of life for replacing two google mini search
appliances used to index our public web sites. Google is no longer
selling the mini appliances and buying the big appliance is not cost
beneficial.
 
 http://search.richmond.edu/
 
 We would run a solr replacement in linux (cents, redhat, similar) with
open Java or Oracle Java.
 
 Background
 ==
 ~130 sites
 only ~12,000 pages (at a depth of 3)
 probably ~40,000 pages if we go to a depth of 4
 
 We use key matches a lot. In solr terms these are elevated documents
(elevations)
 
 We would code a search query form in php and wrap it into our design
(http://www.richmond.edu)
 
 I have played with and love lucidworks and know that their $ solution
works for our use cases but the cost model is not attractive for such a
small collection.
 
 So with solr what are my open source options and what are people's
experiences crawling and indexing web sites with solr + crawler. I
understand there is not a crawler with solr so that would have to be
first up to get one working.
 
 We can code in Java, PHP, Python etc. if we have to, but we don't want
to write a crawler if we can avoid it.
 
 thanks in advance for and information.
 
 --
 Eric Palmer
 Web Services
 U of Richmond
 
 




Re: [SolrCloud-Solrj] Document router problem connecting to Zookeeper ensemble

2013-10-30 Thread Anshum Gupta
Hi Alessandro,

What version of Solr are you running and what's the version of SolrJ? I am
guessing they are different.




On Wed, Oct 30, 2013 at 8:32 PM, Alessandro Benedetti 
benedetti.ale...@gmail.com wrote:

 I have a zookeeper ensemble hotes in one amazon server.
 Using the CloudSolrServer and trying to connect , I obtain this nreally
 unusual error :

 969 [main] INFO org.apache.solr.common.cloud.ConnectionManager - Client is
 connected to ZooKeeper
 1043 [main] INFO org.apache.solr.common.cloud.ZkStateReader - Updating
 cluster state from ZooKeeper...
 Exception in thread main org.apache.solr.common.SolrException: Unknown
 document router '{name=implicit}'
 at org.apache.solr.common.cloud.DocRouter.getDocRouter(DocRouter.java:46)

 Although in my collection I have the compositeId strategy for routing (
 from the clusterState.json ) .

 This is how I instantiate the server :

 CloudSolrServer server;
 server = new CloudSolrServer(
 ec2-xx.xx.xx.eu-west-1.compute.amazonaws.com:2181,
 ec2-xx.xx.xx.eu-west-1.compute.amazonaws.com:2182,
 ec2-xx.xx.xx.eu-west-1.compute.amazonaws.com:2183);
 server.setDefaultCollection(example);
 SolrPingResponse ping = server.ping();

 Any hint ?
 --
 --

 Benedetti Alessandro
 Visiting card : http://about.me/alessandro_benedetti

 Tyger, tyger burning bright
 In the forests of the night,
 What immortal hand or eye
 Could frame thy fearful symmetry?

 William Blake - Songs of Experience -1794 England




-- 

Anshum Gupta
http://www.anshumgupta.net


AJAX Solr returning the default wildcard *:* and not what I query

2013-10-30 Thread Reyes, Mark
I am currently integrating JavaScript framework AJAX Solr to my domain. I am 
trying to query words such as 'doctorate' or 'programs' but the console is 
reporting '*:*' only the default wildcard.

Just curious if anyone has any helpful hints? The problem can be seen in detail 
on Stackoverflow,
http://stackoverflow.com/questions/19691535/ajax-solr-returning-the-default-wildcard-and-not-what-i-query

Thank you,
Mark

IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages sent from Bridgepoint Education may contain information that is 
confidential and may be legally privileged. Please do not read, copy, forward 
or store this message unless you are an intended recipient of it. If you 
received this transmission in error, please notify the sender by reply e-mail 
and delete the message and any attachments.

Re: Evaluating a SOLR index with trec_eval

2013-10-30 Thread Tom Burton-West
Hi Michael,

I know you are asking about Solr, but in case you haven't seen it, Ian
Soboroff has a nice little demo for Lucene:

https://github.com/isoboroff/trec-demo.

There is also the lucene benchmark code:
http://lucene.apache.org/core/4_5_1/benchmark/org/apache/lucene/benchmark/quality/package-summary.html

Otherwise, all I can think of is writing an app layer that keeps track of
the id, sends the query to Solr, parses the search results and spits out
results in the trec format.  I'd love to find some open-source code that
does what you ask.

I did a quick and dirty version of something like that for the INEX book
track.  I'll see if I can find the code and if it is in any shape to share.

Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Sevice
University of Michigan Library
tburt...@umich.edu
http://www.hathitrust.org/blogs/large-scale-search


On Wed, Oct 30, 2013 at 10:52 AM, Michael Preminger 
michael.premin...@hioa.no wrote:

 Hello!

 Is there a simple way to evaluate a SOLR index with TREC_EVAL?
 I mean:
 *  preparing a query file in some format Solr will understand, but where
 each query has an ID
 * getting results out in trec format, with these query IDs attached

 Thanks

 Michael



SV: Evaluating a SOLR index with trec_eval

2013-10-30 Thread Michael Preminger
Hi, Tom!
Thanks alot. Ill check Ian's stuff and anticpate yours ...

As you know, the ProveIt is now terminated as an INEX track, but we still hope 
to write a paper to a journal, summarizing what was done, and it would be nice 
to have you on.

AND, youll be happy (or shocked) to know that this week I used your INEX paper 
from 2011 as an example 
of practice-near research in a seminar I was running for them, and they had 
an assignment to right a reflection note in advance where they associate your 
stuff with their own assignment.

Michael

Fra: Tom Burton-West [tburt...@umich.edu]
Sendt: 30. oktober 2013 20:26
To: solr-user@lucene.apache.org
Emne: Re: Evaluating a SOLR index with trec_eval

Hi Michael,

I know you are asking about Solr, but in case you haven't seen it, Ian
Soboroff has a nice little demo for Lucene:

https://github.com/isoboroff/trec-demo.

There is also the lucene benchmark code:
http://lucene.apache.org/core/4_5_1/benchmark/org/apache/lucene/benchmark/quality/package-summary.html

Otherwise, all I can think of is writing an app layer that keeps track of
the id, sends the query to Solr, parses the search results and spits out
results in the trec format.  I'd love to find some open-source code that
does what you ask.

I did a quick and dirty version of something like that for the INEX book
track.  I'll see if I can find the code and if it is in any shape to share.

Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Sevice
University of Michigan Library
tburt...@umich.edu
http://www.hathitrust.org/blogs/large-scale-search


On Wed, Oct 30, 2013 at 10:52 AM, Michael Preminger 
michael.premin...@hioa.no wrote:

 Hello!

 Is there a simple way to evaluate a SOLR index with TREC_EVAL?
 I mean:
 *  preparing a query file in some format Solr will understand, but where
 each query has an ID
 * getting results out in trec format, with these query IDs attached

 Thanks

 Michael



SV: Evaluating a SOLR index with trec_eval

2013-10-30 Thread Michael Preminger
... AND appologies to everyone for erroneously posting irrelevant stuff on the 
list.

Michael

Fra: Michael Preminger [michael.premin...@hioa.no]
Sendt: 30. oktober 2013 20:34
To: solr-user@lucene.apache.org
Emne: SV: Evaluating a SOLR index with trec_eval

Hi, Tom!
Thanks alot. Ill check Ian's stuff and anticpate yours ...

As you know, the ProveIt is now terminated as an INEX track, but we still hope 
to write a paper to a journal, summarizing what was done, and it would be nice 
to have you on.

AND, youll be happy (or shocked) to know that this week I used your INEX paper 
from 2011 as an example
of practice-near research in a seminar I was running for them, and they had 
an assignment to right a reflection note in advance where they associate your 
stuff with their own assignment.

Michael

Fra: Tom Burton-West [tburt...@umich.edu]
Sendt: 30. oktober 2013 20:26
To: solr-user@lucene.apache.org
Emne: Re: Evaluating a SOLR index with trec_eval

Hi Michael,

I know you are asking about Solr, but in case you haven't seen it, Ian
Soboroff has a nice little demo for Lucene:

https://github.com/isoboroff/trec-demo.

There is also the lucene benchmark code:
http://lucene.apache.org/core/4_5_1/benchmark/org/apache/lucene/benchmark/quality/package-summary.html

Otherwise, all I can think of is writing an app layer that keeps track of
the id, sends the query to Solr, parses the search results and spits out
results in the trec format.  I'd love to find some open-source code that
does what you ask.

I did a quick and dirty version of something like that for the INEX book
track.  I'll see if I can find the code and if it is in any shape to share.

Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Sevice
University of Michigan Library
tburt...@umich.edu
http://www.hathitrust.org/blogs/large-scale-search


On Wed, Oct 30, 2013 at 10:52 AM, Michael Preminger 
michael.premin...@hioa.no wrote:

 Hello!

 Is there a simple way to evaluate a SOLR index with TREC_EVAL?
 I mean:
 *  preparing a query file in some format Solr will understand, but where
 each query has an ID
 * getting results out in trec format, with these query IDs attached

 Thanks

 Michael



Re: AJAX Solr returning the default wildcard *:* and not what I query

2013-10-30 Thread Shawn Heisey

On 10/30/2013 1:26 PM, Reyes, Mark wrote:

I am currently integrating JavaScript framework AJAX Solr to my domain. I am 
trying to query words such as 'doctorate' or 'programs' but the console is 
reporting '*:*' only the default wildcard.

Just curious if anyone has any helpful hints? The problem can be seen in detail 
on Stackoverflow,
http://stackoverflow.com/questions/19691535/ajax-solr-returning-the-default-wildcard-and-not-what-i-query


We would have to know what Solr is actually receiving from your app. The 
Solr log should have an entry for every query you do, and it includes 
all of the parameters for that quey.  This is *not* the Logging tab in 
the admin UI, but the actual logfile.  On Solr 4.3 and later with the 
example logging setup, this is typically $CWD/logs/solr.log.


Thanks,
Shawn



ReplicationHandler - SnapPull failed to download a file completely.

2013-10-30 Thread Shalom Ben-Zvi Kazaz
we are continuously getting this exception during replication from
master to slave. our index size is 9.27 G and we are trying to replicate
a slave from scratch.
Its a different file each time , sometimes we get to 60% replication
before it fails and sometimes only 10%, we never managed a successful
replication.

30 Oct 2013 18:38:52,884 [explicit-fetchindex-cmd] ERROR
ReplicationHandler - SnapPull failed
:org.apache.solr.common.SolrException: Unable to download
_aa7_Lucene41_0.tim completely. Downloaded 0!=1054090
at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1244)
at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1124)
at
org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:719)
at
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:397)
at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:317)
at
org.apache.solr.handler.ReplicationHandler$1.run(ReplicationHandler.java:218)

I read in some thread that there was a related bug in solr 4.1, but we
are using solr 4.3 and tried with 4.5.1 also.
It seams that DirectoryFileFetcher can not download a file sometimes ,
the files is downloaded to the slave in size zero.
we are running in a test environment where bandwidth is high.

this is the master setup:

|requestHandler name=/replication class=solr.ReplicationHandler 
   lst name=master
 str name=replicateAftercommit/str
 str name=replicateAfterstartup/str
 str 
name=confFilesstopwords.txt,spellings.txt,synonyms.txt,protwords.txt,elevate.xml,currency.xml/str
 str name=commitReserveDuration00:00:50/str
   /lst
/requestHandler
|

and the slave setup:

| requestHandler name=/replication
class=|||solr.ReplicationHandler|
lst name=slave
str 
name=masterUrlhttp://solr-master.saltdev.sealdoc.com:8081/solr-master/str
str name=httpConnTimeout15/str
str name=httpReadTimeout30/str
/lst
/requestHandler

|



Re: AJAX Solr returning the default wildcard *:* and not what I query

2013-10-30 Thread Reyes, Mark
solr.log file per Solr 4.5

http://pastebin.com/zSpERJZA


Thanks Shawn,
Mark



On 10/30/13, 12:44 PM, Shawn Heisey s...@elyograg.org wrote:

On 10/30/2013 1:26 PM, Reyes, Mark wrote:
 I am currently integrating JavaScript framework AJAX Solr to my domain.
I am trying to query words such as 'doctorate' or 'programs' but the
console is reporting '*:*' only the default wildcard.

 Just curious if anyone has any helpful hints? The problem can be seen
in detail on Stackoverflow,
 
http://stackoverflow.com/questions/19691535/ajax-solr-returning-the-defau
lt-wildcard-and-not-what-i-query

We would have to know what Solr is actually receiving from your app. The
Solr log should have an entry for every query you do, and it includes
all of the parameters for that quey.  This is *not* the Logging tab in
the admin UI, but the actual logfile.  On Solr 4.3 and later with the
example logging setup, this is typically $CWD/logs/solr.log.

Thanks,
Shawn



IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages sent from Bridgepoint Education may contain information that is 
confidential and may be legally privileged. Please do not read, copy, forward 
or store this message unless you are an intended recipient of it. If you 
received this transmission in error, please notify the sender by reply e-mail 
and delete the message and any attachments.

Re: ReplicationHandler - SnapPull failed to download a file completely.

2013-10-30 Thread Shawn Heisey

On 10/30/2013 1:49 PM, Shalom Ben-Zvi Kazaz wrote:

we are continuously getting this exception during replication from
master to slave. our index size is 9.27 G and we are trying to replicate
a slave from scratch.
Its a different file each time , sometimes we get to 60% replication
before it fails and sometimes only 10%, we never managed a successful
replication.


snip


this is the master setup:

|requestHandler name=/replication class=solr.ReplicationHandler 
lst name=master
  str name=replicateAftercommit/str
  str name=replicateAfterstartup/str
  str 
name=confFilesstopwords.txt,spellings.txt,synonyms.txt,protwords.txt,elevate.xml,currency.xml/str
  str name=commitReserveDuration00:00:50/str
/lst
/requestHandler


I assume that you're probably doing commits fairly often, resulting in a 
lot of merge activity that frequently deletes segments.  That 
commitReserveDuration parameter needs to be made larger.  I would 
imagine that it takes a lot more than 50 seconds to do the replication - 
even if you've got an extremely fast network, replicating 9.7GB probably 
takes several minutes.


From the wiki page on replication:  If your commits are very frequent 
and network is particularly slow, you can tweak an extra attribute 
str name=commitReserveDuration00:00:10/str. This is roughly the 
time taken to download 5MB from master to slave. Default is 10 secs.


http://wiki.apache.org/solr/SolrReplication#Master

You've said that your network is not slow, but with that much data, all 
networks are slow.


Thanks,
Shawn



Re: AJAX Solr returning the default wildcard *:* and not what I query

2013-10-30 Thread Shawn Heisey

On 10/30/2013 1:55 PM, Reyes, Mark wrote:

solr.log file per Solr 4.5

http://pastebin.com/zSpERJZA


Your queries all look like the following, with different numbers for the 
parameters json.wrf and _ (underscore) that I've never seen before, and 
I assume Solr just ignores.


{json.wrf=jQuery171015135826403275132_1383154109139q=*:*_=1383154109332wt=json}

Those query parameters include q=*:*, so Solr is returning what it was 
asked for.  You'll need to figure out why your ajax code is not sending 
q=doctorate or q=programs instead.


Thanks,
Shawn



Re: Replacing Google Mini Search Appliance with Solr?

2013-10-30 Thread Rajani Maski
Hi Eric,

  I have also developed mini-applications replacing GSA for some of our
clients using Apache Nutch + Solr to crawl multi lingual sites and enable
multi-lingual search. Nutch+Solr is very stable and Nutch mailing list
provides a good support.

Reference link to start:
https://sites.google.com/site/profilerajanimaski/webcrawlers/apache-nutch

Thanks
Rajani




On Thu, Oct 31, 2013 at 12:27 AM, Palmer, Eric epal...@richmond.edu wrote:

 Markus and Jason

 thanks for the info.

 I will start to research Nutch.  Writing a crawler, agree it is a rabbit
 hole.


 --
 Eric Palmer

 Web Services
 U of Richmond

 To report technical issues, obtain technical support or make requests for
 enhancements please visit
 http://web.richmond.edu/contact/technical-support.html





 On 10/30/13 2:53 PM, Jason Hellman jhell...@innoventsolutions.com
 wrote:

 Nutch is an excellent option.  It should feel very comfortable for people
 migrating away from the Google appliances.
 
 Apache Droids is another possible way to approach, and I¹ve found people
 using Heretrix or Manifold for various use cases (and usually in
 combination with other use cases where the extra overhead was worth the
 trouble).
 
 I think the simples approach will be NutchŠit¹s absolutely worth taking a
 shot at it.
 
 DO NOT write a crawler!  That is a rabbit hole you do not want to peer
 down into :)
 
 
 
 On Oct 30, 2013, at 10:54 AM, Markus Jelsma markus.jel...@openindex.io
 wrote:
 
  Hi Eric,
 
  We have also helped some government institution to replave their
 expensive GSA with open source software. In our case we use Apache Nutch
 1.7 to crawl the websites and index to Apache Solr. It is very
 effective, robust and scales easily with Hadoop if you have to. Nutch
 may not be the easiest tool for the job but is very stable, feature rich
 and has an active community here at Apache.
 
  Cheers,
 
  -Original message-
  From:Palmer, Eric epal...@richmond.edu
  Sent: Wednesday 30th October 2013 18:48
  To: solr-user@lucene.apache.org
  Subject: Replacing Google Mini Search Appliance with Solr?
 
  Hello all,
 
  Been lurking on the list for awhile.
 
  We are at the end of life for replacing two google mini search
 appliances used to index our public web sites. Google is no longer
 selling the mini appliances and buying the big appliance is not cost
 beneficial.
 
  http://search.richmond.edu/
 
  We would run a solr replacement in linux (cents, redhat, similar) with
 open Java or Oracle Java.
 
  Background
  ==
  ~130 sites
  only ~12,000 pages (at a depth of 3)
  probably ~40,000 pages if we go to a depth of 4
 
  We use key matches a lot. In solr terms these are elevated documents
 (elevations)
 
  We would code a search query form in php and wrap it into our design
 (http://www.richmond.edu)
 
  I have played with and love lucidworks and know that their $ solution
 works for our use cases but the cost model is not attractive for such a
 small collection.
 
  So with solr what are my open source options and what are people's
 experiences crawling and indexing web sites with solr + crawler. I
 understand there is not a crawler with solr so that would have to be
 first up to get one working.
 
  We can code in Java, PHP, Python etc. if we have to, but we don't want
 to write a crawler if we can avoid it.
 
  thanks in advance for and information.
 
  --
  Eric Palmer
  Web Services
  U of Richmond
 
 
 




Re: AJAX Solr returning the default wildcard *:* and not what I query

2013-10-30 Thread Anshum Gupta
As Shawn pointed out, seems like your client is actually sending out *:*
queries all of the times.
You perhaps have the wrong id for the search box or something that results
in your ajax library to never actually receive the actual input value, but
I'm just guessing.



On Thu, Oct 31, 2013 at 1:25 AM, Reyes, Mark mark.re...@bpiedu.com wrote:

 solr.log file per Solr 4.5

 http://pastebin.com/zSpERJZA


 Thanks Shawn,
 Mark



 On 10/30/13, 12:44 PM, Shawn Heisey s...@elyograg.org wrote:

 On 10/30/2013 1:26 PM, Reyes, Mark wrote:
  I am currently integrating JavaScript framework AJAX Solr to my domain.
 I am trying to query words such as 'doctorate' or 'programs' but the
 console is reporting '*:*' only the default wildcard.
 
  Just curious if anyone has any helpful hints? The problem can be seen
 in detail on Stackoverflow,
 
 
 http://stackoverflow.com/questions/19691535/ajax-solr-returning-the-defau
 lt-wildcard-and-not-what-i-query
 
 We would have to know what Solr is actually receiving from your app. The
 Solr log should have an entry for every query you do, and it includes
 all of the parameters for that quey.  This is *not* the Logging tab in
 the admin UI, but the actual logfile.  On Solr 4.3 and later with the
 example logging setup, this is typically $CWD/logs/solr.log.
 
 Thanks,
 Shawn
 


 IMPORTANT NOTICE: This e-mail message is intended to be received only by
 persons entitled to receive the confidential information it may contain.
 E-mail messages sent from Bridgepoint Education may contain information
 that is confidential and may be legally privileged. Please do not read,
 copy, forward or store this message unless you are an intended recipient of
 it. If you received this transmission in error, please notify the sender by
 reply e-mail and delete the message and any attachments.




-- 

Anshum Gupta
http://www.anshumgupta.net


Re: Problem querying with edismax and hyphens

2013-10-30 Thread Michael Nilsson
I too have come across this same exact problem.  One thing that I have
found is that with autoGeneratePhraseQueries=true, you can find the case
where your index has 'z score' and your query is z-score, but with false it
will not find it.  As to your specific problem with the single token zscore
in the index and z-score as the query, I'm still stumped. Hopefully someone
else can answer this question?


On Wed, Oct 30, 2013 at 11:56 AM, Vardhan Dharnidharka 
vardhan1...@hotmail.com wrote:

 Hi,

 The query z-score doesn't match a doc with zscore in the index. The
 analysis tool shows that this query would match this data in the index, but
 it's the edismax query parser step that seems to screw things up. Is there
 some combination of autoGeneratePhraseQueries, WordDelimiterFilterFactory
 parameters, and/or something else I can change or add to generically make
 the query match without modifying the mm? ie. without adding a rule to
 specifically synonymize or split the term zscore with some dictionary of
 words.

 The query I want to match but doesn't:
 z-score
 mm=-30%

 In the index:
 zscore

 The analyzer:

 fieldType autoGeneratePhraseQueries=false class=solr.TextField
 name=lowStopText positionIncrementGap=100

 analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
  filter catenateAll=1 catenateNumbers=1 catenateWords=1
 class=solr.WordDelimiterFilterFactory preserveOriginal=1
 splitOnCaseChange=0 splitOnNumerics=0 types=wdfftypes.txt/
  filter class=solr.ICUFoldingFilterFactory/
 /analyzer

 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter catenateAll=1 catenateNumbers=1 catenateWords=1
 class=solr.WordDelimiterFilterFactory preserveOriginal=1
 splitOnCaseChange=0 splitOnNumerics=0 types=wdfftypes.txt/
 filter class=solr.ICUFoldingFilterFactory/
 filter class=solr.StopFilterFactory enablePositionIncrements=true
 ignoreCase=true words=stopwords.txt/
 /analyzer
 /fieldType

 The parsed edismax query with autoGeneratePhraseQueries=true:
 +(def_term:\(z-score z) (score zscore)\)

 The parsed edismax query with autoGeneratePhraseQueries=false:
 +(((def_term:z-score def_term:z def_term:score def_term:zscore)~3))

 Thanks
 Vardhan



Re: Replacing Google Mini Search Appliance with Solr?

2013-10-30 Thread Palmer, Eric
Thanks for the link

Sent from my iPhone

On Oct 30, 2013, at 4:06 PM, Rajani Maski rajinima...@gmail.com wrote:

 Hi Eric,
 
  I have also developed mini-applications replacing GSA for some of our
 clients using Apache Nutch + Solr to crawl multi lingual sites and enable
 multi-lingual search. Nutch+Solr is very stable and Nutch mailing list
 provides a good support.
 
 Reference link to start:
 https://sites.google.com/site/profilerajanimaski/webcrawlers/apache-nutch
 
 Thanks
 Rajani
 
 
 
 
 On Thu, Oct 31, 2013 at 12:27 AM, Palmer, Eric epal...@richmond.edu wrote:
 
 Markus and Jason
 
 thanks for the info.
 
 I will start to research Nutch.  Writing a crawler, agree it is a rabbit
 hole.
 
 
 --
 Eric Palmer
 
 Web Services
 U of Richmond
 
 To report technical issues, obtain technical support or make requests for
 enhancements please visit
 http://web.richmond.edu/contact/technical-support.html
 
 
 
 
 
 On 10/30/13 2:53 PM, Jason Hellman jhell...@innoventsolutions.com
 wrote:
 
 Nutch is an excellent option.  It should feel very comfortable for people
 migrating away from the Google appliances.
 
 Apache Droids is another possible way to approach, and I¹ve found people
 using Heretrix or Manifold for various use cases (and usually in
 combination with other use cases where the extra overhead was worth the
 trouble).
 
 I think the simples approach will be NutchŠit¹s absolutely worth taking a
 shot at it.
 
 DO NOT write a crawler!  That is a rabbit hole you do not want to peer
 down into :)
 
 
 
 On Oct 30, 2013, at 10:54 AM, Markus Jelsma markus.jel...@openindex.io
 wrote:
 
 Hi Eric,
 
 We have also helped some government institution to replave their
 expensive GSA with open source software. In our case we use Apache Nutch
 1.7 to crawl the websites and index to Apache Solr. It is very
 effective, robust and scales easily with Hadoop if you have to. Nutch
 may not be the easiest tool for the job but is very stable, feature rich
 and has an active community here at Apache.
 
 Cheers,
 
 -Original message-
 From:Palmer, Eric epal...@richmond.edu
 Sent: Wednesday 30th October 2013 18:48
 To: solr-user@lucene.apache.org
 Subject: Replacing Google Mini Search Appliance with Solr?
 
 Hello all,
 
 Been lurking on the list for awhile.
 
 We are at the end of life for replacing two google mini search
 appliances used to index our public web sites. Google is no longer
 selling the mini appliances and buying the big appliance is not cost
 beneficial.
 
 http://search.richmond.edu/
 
 We would run a solr replacement in linux (cents, redhat, similar) with
 open Java or Oracle Java.
 
 Background
 ==
 ~130 sites
 only ~12,000 pages (at a depth of 3)
 probably ~40,000 pages if we go to a depth of 4
 
 We use key matches a lot. In solr terms these are elevated documents
 (elevations)
 
 We would code a search query form in php and wrap it into our design
 (http://www.richmond.edu)
 
 I have played with and love lucidworks and know that their $ solution
 works for our use cases but the cost model is not attractive for such a
 small collection.
 
 So with solr what are my open source options and what are people's
 experiences crawling and indexing web sites with solr + crawler. I
 understand there is not a crawler with solr so that would have to be
 first up to get one working.
 
 We can code in Java, PHP, Python etc. if we have to, but we don't want
 to write a crawler if we can avoid it.
 
 thanks in advance for and information.
 
 --
 Eric Palmer
 Web Services
 U of Richmond
 
 


Re: Computing Results So That They are Returned in Search Results

2013-10-30 Thread Upayavira
Also note that function queries only return numbers (given their origin
in scoring). They cannot be used to create virtual string or text
fields.

Upayavira

On Wed, Oct 30, 2013, at 05:19 PM, Jack Krupansky wrote:
 A function query is simply returning a calculated result based on
 existing 
 data - no new fields required.
 
 Did you actually want to precompute a value, store it in the index, and
 then 
 query on it? If so, you could do that indexing with a custom or scripted 
 update processor.
 
 Flesh out an example of exactly what you want.
 
 -- Jack Krupansky
 
 -Original Message- 
 From: Alejandro Calbazana
 Sent: Wednesday, October 30, 2013 12:46 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Computing Results So That They are Returned in Search
 Results
 
 Sounds really close to what I'm looking for, but this sounds like it
 would
 result in a new field on a document (or a new value for a field defined
 to
 hold the result of a function).  Would it be possible for a function
 query
 to produce a new document so that I can associate the computed value with
 it?
 
 Thanks,
 
 Alejandro
 
 
 On Wed, Oct 30, 2013 at 12:05 PM, Jack Krupansky 
 j...@basetechnology.comwrote:
 
  You could create a custom value source and then use it in a function
  query embedded in your return fields list (fl).
 
  So, the function query could use a function (value source) that takes a
  field, fetches its value, performs some arbitrary calculation, and then
  returns that value.
 
  fl=id,name,my-func(field1),my-**func(field2)
 
  -- Jack Krupansky
 
  -Original Message- From: Alejandro Calbazana
  Sent: Wednesday, October 30, 2013 10:10 AM
  To: solr-user@lucene.apache.org
  Subject: Computing Results So That They are Returned in Search Results
 
  I'd like to throw out a design question and see if its possible to solve
  this with Solr.
 
  I have a set of data that is computed that I'd like to make searchable.
  Ideally, I'd like to have all documents indexed and call it the day, but
  the nature of the data is such that it needs to be computed given a
  definition.  I'm interested in searching on definitions and then creating
  results on the fly that are calculated based on something embedded in the
  definition.
 
  Is it possible to embed this calculation login into Solr's result handling
  process?  I know this sounds exotic, but the nature of the data is such
  that I can't index these calculated documents because I don't know what 
  the
  boundary is and specifiying an arbitrary number isn't ideal.
 
  Has anyone run across something like this?
 
  Thanks,
 
  Alejandr
  
 


Re: Unable to add mahout classifier

2013-10-30 Thread Koji Sekiguchi

(13/10/30 22:09), lovely kasi wrote:

Hi,

I made few changes to the solrconfig.xml, created a jar file,added it to
the lib folder of the solr and tried to start it.

THe changes in the solrconfig.xml are

updateRequestProcessorChain name=mahoutclassifier default=true
   processor class=com.mahout.solr.classifier.CategorizeDocumentFac
 str name=inputFieldLEAD_NOTES/str
 str name=outputFieldcategory/str
 str name=defaultCategoryOthers/str
 str name=modelnaiveBayesModel/str
   /processor
   processor class=solr.RunUpdateProcessorFactory/
   processor class=solr.LogUpdateProcessorFactory/
 /updateRequestProcessorChain


What is com.mahout.solr.classifier.CategorizeDocumentFac ?
Is it a classifier delivered by Solr community?

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html


Re: Need idea to standardize keywords - ring tone vs ringtone

2013-10-30 Thread Developer
I tried using synonyms but it doesn't actually change the stored text rather
just the indexed value. 

I need a way to change the raw value stored in SOLR. May be I should use a
custom update processor to standardize the data.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Need-idea-to-standardize-keywords-ring-tone-vs-ringtone-tp4097794p4098530.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Return the synonyms as part of Solr response

2013-10-30 Thread Koji Sekiguchi

Hi Siva,

(13/10/30 18:12), sivaprasad wrote:

Hi,
We have a requirement where we need to send the matched synonyms as part of
Solr response.


I don't think that Solr has such function.


Do we need to customize the Solr response handler to do this?


So the answer is yes.

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html


Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-30 Thread Walter Underwood
A flat distribution of queries is a poor test. Real queries have a zipf 
distribution. The flat distribution will get almost no benefit from caching, so 
it will give too low a number and stress disk IO too much. The 99th percentile 
is probably the same for both distributions, because that is dominated by rare 
queries.

Real query loads will get a much smaller boost from SSD in the median and up to 
about 75th percentile.

wunder
Search guy for Netflix and now Chegg

On Oct 30, 2013, at 1:43 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote:

 On Tue, 2013-10-29 at 14:24 +0100, eShard wrote:
 I have a 1 TB repository with approximately 500,000 documents (that will
 probably grow from there) that needs to be indexed.  
 
 As Shawn point out, that isn't telling us much. If you describe the
 documents, how and how often you index and how you query them, it will
 help a lot.
 
 
 Let me offer some observations from a related project we are starting at
 Statsbiblioteket.
 
 
 We are planning to index 20 TB harvested web resources (*.dk from the
 last 8 years, or at least the resources our crawlers sunk their
 tentacles into). We have two text indexes generated from about 1% and 2%
 of that corpus, respectively. They are 200GB and 420GB in size and
 contains ~75 million and (whoops, offline, so rememberguessing here)
 ~150 million documents.
 
 For testing purposes we issued simple searches: 2-4 OR'ed terms, picked
 at random from a Danish dictionary. One of our test machines is an 2*8
 core Xeon machine with 32GB of RAM (about ~12GB free for caching) and
 SSD as storage. We had room for a 2-shard cloud on the SSD's, so
 searches were issued to 2*200GB index of a total of 150 million
 documents. CentOS/Solr 4.3.
 
 Hammering that machine with 32 threads gave us a median response time of
 200ms and a 99-percentile of 5-800 ms (depending on test run), single
 thread has median 30ms and 99-percentile 70-130ms. CPU load peaked at
 300-400% and IOWait at 30-40%, but was not closely monitored.
 
 Our current vision is to shard the projected 20TB index into ~800GB or
 ~1TB chunks (depending on which drives we choose) and put one chard on
 each physical SSD, thereby sidestepping the whole RAID  TRIM-problem. 
 
 We do have the great luxury of running nightly batch index updates on a
 single shard instead of continuous updates. We would probably go for
 smaller shards if they were all updated continuously.
 
 Projected price for the full setup range from $50.000-$100.000,
 depending on where we land on the off-the-shelf - enterprise scale.
 
 (I need to write a blog post on this)
 
 
 With that in mind, I urge you to do some testing on a machine with SSD
 and modest memory vs. a traditional spinning drives and monster-memory
 machine.
 
 
 - Toke Eskildsen, State and University Library, Denmark






Re: Computing Results So That They are Returned in Search Results

2013-10-30 Thread Alejandro Calbazana
So here is my use case with a little more detail.  I'm working with
recurring events.  Each event has an expression associated with it that
defines its recurrence pattern.  For example, monthly, daily, yearly...
The event has metadata associated with it that is searchable.  When a user
performs a search, they can match on various metadata fields, but the query
can also span a range of dates.  If a match occurs, I'd like to unwind the
expression into the instances specified by the pattern and return these
virtual instances as results.

Right now, I'm post processing data to hammer out the results that fit the
window of time specified in the query, but this moves sorting and
pagination out of the Solr tier.  I'd like to see if I can get it to stay
there :)  Post processing also prohibits me from faceting which would be
extremely useful.

I'm trying to avoid heavy post processing if I can.  Given the nature of
the data, its not really feasible for me to pre-assemble instance data and
index since I don't know the window of time a user will be looking at.

Thanks,

Alejandro


On Wed, Oct 30, 2013 at 6:35 PM, Upayavira u...@odoko.co.uk wrote:

 Also note that function queries only return numbers (given their origin
 in scoring). They cannot be used to create virtual string or text
 fields.

 Upayavira

 On Wed, Oct 30, 2013, at 05:19 PM, Jack Krupansky wrote:
  A function query is simply returning a calculated result based on
  existing
  data - no new fields required.
 
  Did you actually want to precompute a value, store it in the index, and
  then
  query on it? If so, you could do that indexing with a custom or scripted
  update processor.
 
  Flesh out an example of exactly what you want.
 
  -- Jack Krupansky
 
  -Original Message-
  From: Alejandro Calbazana
  Sent: Wednesday, October 30, 2013 12:46 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Computing Results So That They are Returned in Search
  Results
 
  Sounds really close to what I'm looking for, but this sounds like it
  would
  result in a new field on a document (or a new value for a field defined
  to
  hold the result of a function).  Would it be possible for a function
  query
  to produce a new document so that I can associate the computed value with
  it?
 
  Thanks,
 
  Alejandro
 
 
  On Wed, Oct 30, 2013 at 12:05 PM, Jack Krupansky
  j...@basetechnology.comwrote:
 
   You could create a custom value source and then use it in a function
   query embedded in your return fields list (fl).
  
   So, the function query could use a function (value source) that takes a
   field, fetches its value, performs some arbitrary calculation, and then
   returns that value.
  
   fl=id,name,my-func(field1),my-**func(field2)
  
   -- Jack Krupansky
  
   -Original Message- From: Alejandro Calbazana
   Sent: Wednesday, October 30, 2013 10:10 AM
   To: solr-user@lucene.apache.org
   Subject: Computing Results So That They are Returned in Search Results
  
   I'd like to throw out a design question and see if its possible to
 solve
   this with Solr.
  
   I have a set of data that is computed that I'd like to make searchable.
   Ideally, I'd like to have all documents indexed and call it the day,
 but
   the nature of the data is such that it needs to be computed given a
   definition.  I'm interested in searching on definitions and then
 creating
   results on the fly that are calculated based on something embedded in
 the
   definition.
  
   Is it possible to embed this calculation login into Solr's result
 handling
   process?  I know this sounds exotic, but the nature of the data is such
   that I can't index these calculated documents because I don't know what
   the
   boundary is and specifiying an arbitrary number isn't ideal.
  
   Has anyone run across something like this?
  
   Thanks,
  
   Alejandr
  
 



How to get similarity score between 0 and 1 not relative score

2013-10-30 Thread sushil sharma
Hi,
 
We have a requirement where user would like to see a score (between 0 to 1) 
which can tell how close the input search string is with result string. So if 
input was very close but not exact matach, score could be .90 etc.
 
I do understand that we can get score from solr  divide by highest score but 
that will always show 1 even if we match was not exact. 
 
Regards,
Susheel

Re: How to get similarity score between 0 and 1 not relative score

2013-10-30 Thread Anshum Gupta
Hi Susheel,

Have a look at this: http://wiki.apache.org/lucene-java/ScoresAsPercentages

You may really want to reconsider doing that.




On Thu, Oct 31, 2013 at 9:41 AM, sushil sharma sushil2...@yahoo.co.inwrote:

 Hi,

 We have a requirement where user would like to see a score (between 0 to
 1) which can tell how close the input search string is with result string.
 So if input was very close but not exact matach, score could be .90 etc.

 I do understand that we can get score from solr  divide by highest score
 but that will always show 1 even if we match was not exact.

 Regards,
 Susheel




-- 

Anshum Gupta
http://www.anshumgupta.net


Solr grouping performance porblem

2013-10-30 Thread Shamik Bandopadhyay
Hi,

   I've recently upgraded to SolrCloud (4.4) from Master-Slave mode. One of
the changes I did the in queries is to add group functionality to remove
duplicate results. The grouping is done on a specific field. But the change
seemed to have a huge effect on the query performance. The group option
decreased the performance by 10 times. For e.g. this query takes 1 sec to
execute. The number of results is around 105387.

http://localhost:8083/solr/browse?fq=language:(english)wt=xmlrows=10start=0fq=(ContentGroup-local:Learn
 Explore OR ADSKContentGroup-local:Getting Started)q=linesort=score
descgroup=truegroup.field=dedupgroup.ngroups=true

If I exclude group option, it comes down to 190ms

http://localhost:8083/solr/browse?fq=language:(english)wt=xmlrows=10start=0fq=(ContentGroup-local:Learn
 Explore OR ADSKContentGroup-local:Getting Started)q=line

I'm running this query against a 8 million doc index . I've 2 shard with 1
replica each, running on a m1x.large EC2 instance, each having 8gb allocat
ed memory.

Is this a known issue or am I missing something which is making this query
expensive.

I bumped into this JIRA --
https://issues.apache.org/jira/browse/SOLR-5027 which
talks about CollapsingQParserPlugin as an alternate to grouping, but that
seemed to be available in 4.6. Just wondering if it can be an alternate in
my case and whether if its possible to apply as a patch in 4.4 version.

Any pointer will be appreciated.

- Thanks,
Shamik


Re: Grouping performance problem

2013-10-30 Thread shamik
Bumping up this thread as I'm facing similar issue . Any solution ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Grouping-performance-problem-tp3995245p4098566.html
Sent from the Solr - User mailing list archive at Nabble.com.