Skipping duplicates in DataImportHandler based on uniqueKey

2010-05-02 Thread Andrew Clegg
Hi, Is there a way to get the DataImportHandler to skip already-seen records rather than reindexing them? The UpdateHandler has an add overwrite=false ... capability which (as I understand it) means that a document whose uniqueKey matches one already in the index will be skipped instead of

Re: Skipping duplicates in DataImportHandler based on uniqueKey

2010-05-03 Thread Andrew Clegg
Marc Sturlese wrote: You can use deduplication to do that. Create the signature based on the unique field or any field you want. Cool, thanks, I hadn't thought of that. -- View this message in context:

ClassNotFoundException: org.apache.solr.response.VelocityResponseWriter

2010-05-09 Thread Andrew Clegg
Hi, I'm trying to get the Velocity / Solritas feature to work for one core of a two-core Solr instance, but it's not playing nice. I know the right jars are being loaded, because I can see them mentioned in the log, but still I get a class not found exception: 09-May-2010 15:34:02

Re: ClassNotFoundException: org.apache.solr.response.VelocityResponseWriter

2010-05-09 Thread Andrew Clegg
Erik Hatcher-4 wrote: What version of Solr? Try switching to class=solr.VelocityResponseWriter, and if that doesn't work use class=org.apache.solr.request.VelocityResponseWriter. The first form is the recommended way to do it. The actual package changed in trunk not too long

Re: ClassNotFoundException: org.apache.solr.response.VelocityResponseWriter

2010-05-09 Thread Andrew Clegg
Sorry -- in the second of those error messages (the NPE) I meant str name=defTypelucene/str not standard. Andrew Clegg wrote: Erik Hatcher-4 wrote: What version of Solr? Try switching to class=solr.VelocityResponseWriter, and if that doesn't work use class

Fixed: Solritas on multicore Solr, using standard query handler (was Re: ClassNotFoundException: org.apache.solr.response.VelocityResponseWriter)

2010-05-09 Thread Andrew Clegg
or /solr/itas and insert your core name in the middle. (Does anyone know if there'd be a simple way to make that automatic?) Andrew Clegg wrote: Erik Hatcher-4 wrote: What version of Solr? Try switching to class=solr.VelocityResponseWriter, and if that doesn't work use class

How bad is stopping Solr with SIGKILL?

2010-05-31 Thread Andrew Clegg
Hi folks, I had a Solr instance (in Jetty on Linux) taken down by a process monitoring tool (God) with a SIGKILL recently. How bad is this? Can it cause index corruption if it's in the middle of indexing something? Or will it just lose uncommitted changes? What if the signal arrives in the

Re: Indexing link targets in HTML fragments

2010-06-07 Thread Andrew Clegg
Lance Norskog-2 wrote: The PatternReplace and HTMPStrip tokenizers might be the right bet. The easiest way to go about this is to make a bunch of text fields with different analysis stacks and investigate them in the Scema Browser. You can paste an HTML document into the text box and see

Re: Indexing link targets in HTML fragments

2010-06-07 Thread Andrew Clegg
findbestopensource wrote: Could you tell us your schema used for indexing. In my opinion, using standardanalyzer / Snowball analyzer will do the best. They will not break the URLs. Add href, and other related html tags as part of stop words and it will removed while indexing. This

Re: Filtering near-duplicates using TextProfileSignature

2010-06-08 Thread Andrew Clegg
Andrew Clegg wrote: Re. your config, I don't see a minTokenLength in the wiki page for deduplication, is this a recent addition that's not documented yet? Sorry about this -- stupid question -- I should have read back through the thread and refreshed my memory. -- View this message

Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Andrew Clegg
Markus Jelsma wrote: Well, it got me too! KMail didn't properly order this thread. Can't seem to find Hatcher's reply anywhere. ??!!? Whole thread here: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tt479039.html -- View this message in

Duplicate items in distributed search

2010-07-04 Thread Andrew Clegg
Hi, I'm after a bit of clarification about the 'limitations' section of the distributed search page on the wiki. The first two limitations say: * Documents must have a unique key and the unique key must be stored (stored=true in schema.xml) * When duplicate doc IDs are received, Solr chooses

Re: Duplicate items in distributed search

2010-07-04 Thread Andrew Clegg
Mark Miller-3 wrote: On 7/4/10 12:49 PM, Andrew Clegg wrote: I thought so but thanks for clarifying. Maybe a wording change on the wiki Sounds like a good idea - go ahead and make the change if you'd like. That page seems to be marked immutable... -- View this message in context

Re: Using symlinks to alias cores

2010-07-10 Thread Andrew Clegg
Chris Hostetter-3 wrote: a cleaner way to deal with this would be do use something like RewriteRule -- either in your appserver (if it supports a feature like that) or in a proxy sitting in front of Solr. I think we'll go with this -- seems like the most bulletproof way. Cheers,

SolrCloud in production?

2010-07-24 Thread Andrew Clegg
Is anyone using ZooKeeper-based Solr Cloud in production yet? Any war stories? Any problematic missing features? Thanks, Andrew. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-in-production-tp991995p991995.html Sent from the Solr - User mailing list archive at

maxMergeDocs and performance tuning

2010-08-15 Thread Andrew Clegg
Hi, I'm a little confused about how the tuning params in solrconfig.xml actually work. My index currently has mergeFactor=25 and maxMergeDocs=2147483647. So this means that up to 25 segments can be created before a merge happens, and each segment can have up to 2bn docs in, right? But this

Re: maxMergeDocs and performance tuning

2010-08-17 Thread Andrew Clegg
Okay, thanks Marc. I don't really have any complaints about performance (yet!) but I'm still wondering how the mechanics work, e.g. when you have a number of segments equal to mergeFactor, and each contains maxMergeDocs documents. The docs are a bit fuzzy on this... -- View this message in

Duplicate docs when mergin

2010-08-21 Thread Andrew Clegg
-- View this message in context: http://lucene.472066.n3.nabble.com/Duplicate-docs-when-mergin-tp1261979p1261979.html Sent from the Solr - User mailing list archive at Nabble.com.

Result missing from query, but match shows in Field Analysis tool

2009-10-23 Thread Andrew Clegg
Hi, I have a field in my index called related_ids, indexed and stored, with the following field type: !-- A text field that tokenizes on whitespace, removing non-word characters at the start and end of each token, but preserving meaningful punctuation *within*

Re: Result missing from query, but match shows in Field Analysis tool

2009-10-23 Thread Andrew Clegg
? That 1cuk is past the 10,000th term in record 2.40? For this to be possible, I have to assume that the FieldAnalysis tool ignores this limit FWIW Erick On Fri, Oct 23, 2009 at 12:01 PM, Andrew Clegg andrew.cl...@gmail.comwrote: Hi, I have a field in my index called related_ids

Solr ignoring maxFieldLength?

2009-10-26 Thread Andrew Clegg
Morning, Last week I was having a problem with terms visible in my search results in large documents not causing query hits: http://www.nabble.com/Result-missing-from-query%2C-but-match-shows-in-Field-Analysis-tool-td26029040.html#a26029351 Erick suggested it might be related to

Re: Solr ignoring maxFieldLength?

2009-10-26 Thread Andrew Clegg
being ignored. -Yonik http://www.lucidimagination.com On Mon, Oct 26, 2009 at 7:11 AM, Andrew Clegg andrew.cl...@gmail.com wrote: Morning, Last week I was having a problem with terms visible in my search results in large documents not causing query hits: http://www.nabble.com

Re: Solr ignoring maxFieldLength?

2009-10-26 Thread Andrew Clegg
Yonik Seeley-2 wrote: Sorry Andrew, this is something that's bitten people before. search for maxFieldLength and you will see *2* of them in your config - one for indexDefaults and one for mainIndex. The one in mainIndex is set at 1 and hence overrides the one in indexDefaults.

Re: Solr ignoring maxFieldLength?

2009-10-26 Thread Andrew Clegg
Yonik Seeley-2 wrote: If you could, it would be great if you could test commenting out the one in mainIndex and see if it inherits correctly from indexDefaults... if so, I can comment it out in the example and remove one other little thing that people could get wrong. Yep, it seems

Re: Greater-than and less-than in data import SQL queries

2009-10-27 Thread Andrew Clegg
which make it ugly but whatcha gonna do? Erik On Oct 27, 2009, at 11:50 AM, Andrew Clegg wrote: Hi, If I have a DataImportHandler query with a greater-than sign in, like this: entity name=higher_node dataSource=database query=select *, title as keywords from

Re: Faceting within one document

2009-10-28 Thread Andrew Clegg
://wiki.apache.org/solr/TermsComponent Helps? Cheers Avlesh On Wed, Oct 28, 2009 at 11:32 PM, Andrew Clegg andrew.cl...@gmail.comwrote: Hi, If I give a query that matches a single document, and facet on a particular field, I get a list of all the terms in that field which appear

dismax and query analysis

2009-10-29 Thread Andrew Clegg
Morning, Can someone clarify how dismax queries work under the hood? I couldn't work this particular point out from the documentation... I get that they pretty much issue the user's query against all of the fields in the schema -- or rather, all of the fields you've specified in the qf

Re: dismax and query analysis

2009-10-29 Thread Andrew Clegg
to that particular field for queries (as opposed to indexing). For example, if test is matched against a string vs text field, different analyzers may be applied to string or text Hope that helps Amit On Thu, Oct 29, 2009 at 4:39 AM, Andrew Clegg andrew.cl...@gmail.comwrote: Morning, Can someone

Re: Faceting within one document

2009-10-29 Thread Andrew Clegg
-value facets. On Wed, Oct 28, 2009 at 11:36 AM, Andrew Clegg andrew.cl...@gmail.com wrote: Isn't the TermVectorComponent more for one document at a time, and the TermsComponent for the whole index? Actually -- having done some digging... What I'm really after is the most informative terms

NullPointerException with TermVectorComponent

2009-11-02 Thread Andrew Clegg
Hi, I've recently added the TermVectorComponent as a separate handler, following the example in the supplied config file, i.e.: searchComponent name=tvComponent class=org.apache.solr.handler.component.TermVectorComponent/ requestHandler name=/tvrh

Highlighting is very slow

2009-11-03 Thread Andrew Clegg
Hi everyone, I'm experimenting with highlighting for the first time, and it seems shockingly slow for some queries. For example, this query: http://server:8080/solr/select/?q=transferaseqt=dismaxversion=2.2start=0rows=10indent=on takes 313ms. But when I add highlighting:

Re: Highlighting is very slow

2009-11-04 Thread Andrew Clegg
not with those really long response times). Fixed by moving to JRE 1.6 and tuning garbage collection. Bye, Jaco. 2009/11/3 Andrew Clegg andrew.cl...@gmail.com Hi everyone, I'm experimenting with highlighting for the first time, and it seems shockingly slow for some queries. For example

Re: Highlighting is very slow

2009-11-09 Thread Andrew Clegg
Nicolas Dessaigne wrote: Alternatively, you could use a copyfield with a maxChars limit as your highlighting field. Works well in my case. Thanks for the tip. We did think about doing something similar (only enabling highlighting for certain shorter fields) but we decided that perhaps

Selection of terms for MoreLikeThis

2009-11-10 Thread Andrew Clegg
Hi, If I run a MoreLikeThis query like the following: http://www.cathdb.info/solr/mlt?q=id:3.40.50.720rows=0mlt.interestingTerms=listmlt.match.include=falsemlt.fl=keywordsmlt.mintf=1mlt.mindf=1 one of the hits in the results is and (I don't do any stopword removal on this field). However if I

Re: Arguments for Solr implementation at public web site

2009-11-13 Thread Andrew Clegg
Lukáš Vlček wrote: I am looking for good arguments to justify implementation a search for sites which are available on the public internet. There are many sites in powered by Solr section which are indexed by Google and other search engines but still they decided to invest resources into

Data import problem with child entity from different database

2009-11-13 Thread Andrew Clegg
Morning all, I'm having problems with joining child a child entity from one database to a parent from another... My entity definitions look like this (names changed for brevity): entity name=parent dataSource=db1 query=select a, b, c from parent_table entity name=child dataSource=db2

Re: Arguments for Solr implementation at public web site

2009-11-13 Thread Andrew Clegg
Lukáš Vlček wrote: When you need to search for something Lucene or Solr related, which one do you use: - generic Google - go to a particular mail list web site and search from here (if there is any search form at all) Both of these (Nabble in the second case) in case any recent posts

Re: Selection of terms for MoreLikeThis

2009-11-13 Thread Andrew Clegg
Any ideas on this? Is it worth sending a bug report? Those links are live, by the way, in case anyone wants to verify that MLT is returning suggestions with very low tf.idf. Cheers, Andrew. Andrew Clegg wrote: Hi, If I run a MoreLikeThis query like the following: http

Re: Data import problem with child entity from different database

2009-11-13 Thread Andrew Clegg
Noble Paul നോബിള്‍ नोब्ळ्-2 wrote: no obvious issues. you may post your entire data-config.xml Here it is, exactly as last attempt but with usernames etc. removed. Ignore the comments and the unused FileDataSource... http://old.nabble.com/file/p26335171/dataimport.temp.xml

Re: Selection of terms for MoreLikeThis

2009-11-13 Thread Andrew Clegg
Chantal Ackermann wrote: no idea, I'm afraid - but could you sent the output of interestingTerms=details? This at least would show what MoreLikeThis uses, in comparison to the TermVectorComponent you've already pasted. I can, but I'm afraid they're not very illuminating!

Re: 'Connection reset' in DataImportHandler Development Console

2009-12-14 Thread Andrew Clegg
aerox7 wrote: Hi Andrew, I download the last build of solr (1.4) and i have the same probleme with DebugNow in Dataimport dev Console. have you found a solution ? Sorry about slow reply, I've been on holiday. No, I never found a solution, it worked in some nightlies but not in others,

Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg
Hi, I'm interested in near-dupe removal as mentioned (briefly) here: http://wiki.apache.org/solr/Deduplication However the link for TextProfileSignature hasn't been filled in yet. Does anyone have an example of using TextProfileSignature that demonstrates the tunable parameters mentioned in

Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg
something. Thanks again, Andrew. Erik Hatcher-4 wrote: On Jan 12, 2010, at 7:56 AM, Andrew Clegg wrote: I'm interested in near-dupe removal as mentioned (briefly) here: http://wiki.apache.org/solr/Deduplication However the link for TextProfileSignature hasn't been filled in yet. Does

Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg
Erik Hatcher-4 wrote: On Jan 12, 2010, at 9:15 AM, Andrew Clegg wrote: Thanks Erik, but I'm still a little confused as to exactly where in the Solr config I set these parameters. You'd configure them within the processor element, something like this: str name=minTokenLen5

Replication snapshot, tar says file changed as we read it

2011-01-16 Thread Andrew Clegg
(Many apologies if this appears twice, I tried to send it via Nabble first but it seems to have got stuck, and is fairly urgent/serious.) Hi, I'm trying to use the replication handler to take snapshots, then archive them and ship them off-site. Just now I got a message from tar that worried me:

Re: Replication snapshot, tar says file changed as we read it

2011-01-16 Thread Andrew Clegg
:30, Andrew Clegg andrew.cl...@gmail.com wrote: (Many apologies if this appears twice, I tried to send it via Nabble first but it seems to have got stuck, and is fairly urgent/serious.) Hi, I'm trying to use the replication handler to take snapshots, then archive them and ship them off-site

Re: Replication snapshot, tar says file changed as we read it

2011-03-24 Thread Andrew Clegg
Thanks, Andrew. On 16 January 2011 12:55, Andrew Clegg andrew.cl...@gmail.com wrote: PS one other point I didn't mention is that this server has a very fast autocommit limit (2 seconds max time). But I don't know if this is relevant -- I thought the files in the snapshot wouldn't

NullPointerException in DataImportHandler

2009-07-30 Thread Andrew Clegg
First of all, apologies if you get this twice. I posted it by email an hour ago but it hasn't appeared in any of the archives, so I'm worried it's got junked somewhere. I'm trying to use a DataImportHandler to merge some data from a database with some other fields from a collection of XML files,

Re: NullPointerException in DataImportHandler

2009-07-30 Thread Andrew Clegg
Chantal Ackermann wrote: Hi Andrew, your inner entity uses an XML type datasource. The default entity processor is the SQL one, however. For your inner entity, you have to specify the correct entity processor explicitly. You do that by adding the attribute processor, and the value

Re: NullPointerException in DataImportHandler

2009-07-30 Thread Andrew Clegg
Erik Hatcher wrote: On Jul 30, 2009, at 11:54 AM, Andrew Clegg wrote: entity dataSource=filesystem name=domain_pdb url=${domain.pdb_code}-noatom.xml processor=XPathEntityProcessor forEach=/ field column=content xpath=//*[local-name()='structCategory']/*[local

Re: NullPointerException in DataImportHandler

2009-07-30 Thread Andrew Clegg
Chantal Ackermann wrote: my experience with XPathEntityProcessor is non-existent. ;-) Don't worry -- your hints put me on the right track :-) I got it working with: entity dataSource=filesystem name=domain_pdb url=${domain.pdb_code}-noatom.xml

Questions about XPath in data import handler

2009-08-13 Thread Andrew Clegg
A couple of questions about the DIH XPath syntax... The docs say it supports: xpath=/a/b/subje...@qualifier='fullTitle'] xpath=/a/b/subject/@qualifier xpath=/a/b/c Does the second one mean select the value of the attribute called qualifier in the /a/b/subject element? e.g. For this

Re: Questions about XPath in data import handler

2009-08-13 Thread Andrew Clegg
Andrew Clegg wrote: subject qualifier=some text / Sorry, Nabble swallowed my XML example. That was supposed to be [a] [b] [subject qualifier=some text /] [/b] [/a] ... but in XML. Andrew. -- View this message in context: http://www.nabble.com/Questions-about-XPath-in-data

Re: Questions about XPath in data import handler

2009-08-13 Thread Andrew Clegg
Noble Paul നോബിള്‍ नोब्ळ्-2 wrote: On Thu, Aug 13, 2009 at 6:35 PM, Andrew Cleggandrew.cl...@gmail.com wrote: Does the second one mean select the value of the attribute called qualifier in the /a/b/subject element? yes you are right. Isn't that the semantics of standard xpath

Re: Questions about XPath in data import handler

2009-08-14 Thread Andrew Clegg
Noble Paul നോബിള്‍ नोब्ळ्-2 wrote: yes. look at the 'flatten' attribute in the field. It should give you all the text (not attributes) under a given node. I missed that one -- many thanks. Andrew. -- View this message in context:

'Connection reset' in DataImportHandler Development Console

2009-08-17 Thread Andrew Clegg
Hi folks, I'm trying to use the Debug Now button in the development console to test the effects of some changes in my data import config (see attached). However, each time I click it, the right-hand frame fails to load -- it just gets replaced with the standard 'connection reset' message from

Re: 'Connection reset' in DataImportHandler Development Console

2009-08-17 Thread Andrew Clegg
Noble Paul നോബിള്‍ नोब्ळ्-2 wrote: apparently I do not see any command full-import, delta-import being fired. Is that true? It seems that way -- they're not appearing in the logs. I've tried Debug Now with both full and delta selected from the dropdown, no difference either way. If I

Re: Solr Range Query Anomalities

2009-08-20 Thread Andrew Clegg
Try a sdouble or sfloat field type? Andrew. johan.sjoberg wrote: Hi, we're performing range queries of a field which is of type double. Some queries which should generate results does not, and I think it's best explained by the following examples; it's also expected to exist data in

Re: Wildcard seaches?

2009-08-20 Thread Andrew Clegg
Paul Tomblin wrote: Is there such a thing as a wildcard search? If I have a simple solr.StrField with no analyzer defined, can I query for foo* or foo.* and get everything that starts with foo such as 'foobar and foobaz? Yes. foo* is fine even on a simple string field. Andrew. --

Re: can solr accept other tag other than field?

2009-08-20 Thread Andrew Clegg
You can use the Data Import Handler to pull data out of any XML or SQL data source: http://wiki.apache.org/solr/DataImportHandler Andrew. Elaine Li wrote: Hi, I am new solr user. I want to use solr search to run query against many xml files I have. I have set up the solr server to

Problem getting Solr home from JNDI in Tomcat

2009-09-29 Thread Andrew Clegg
Hi all, I'm having problems getting Solr to start on Tomcat 6. Tomcat is installed in /opt/apache-tomcat , solr is in /opt/apache-tomcat/webapps/solr , and my Solr home directory is /opt/solr . My config file is in /opt/solr/conf/solrconfig.xml . I have a Solr-specific context file in

Re: Problem getting Solr home from JNDI in Tomcat

2009-09-29 Thread Andrew Clegg
Constantijn Visinescu wrote: This might be a bit of a hack but i got this in the web.xml of my applicatin and it works great. !-- People who want to hardcode their Solr Home directly into the WAR File can set the JNDI property here... -- env-entry

Re: Problem getting Solr home from JNDI in Tomcat

2009-09-30 Thread Andrew Clegg
hossman wrote: : Hi all, I'm having problems getting Solr to start on Tomcat 6. which version of Solr? Sorry -- a nightly build from about a month ago. Re. your other message, I was sure the two machines had the same version on, but maybe not -- when I'm back in the office tomorrow

Re: Problem getting Solr home from JNDI in Tomcat

2009-10-01 Thread Andrew Clegg
Andrew Clegg wrote: hossman wrote: This is why the examples of using context files on the wiki talk about keeping the war *outside* of the webapps directory, and using docBase in your Context declaration... http://wiki.apache.org/solr/SolrTomcat Great, I'll try

Quotes in query string cause NullPointerException

2009-10-01 Thread Andrew Clegg
Hi folks, I'm using the 2009-09-30 build, and any single or double quotes in the query string cause an NPE. Is this normal behaviour? I never tried it with my previous installation. Example: http://myserver:8080/solr/select/?title:%22Creatine+kinase%22 (I've also tried without the URL

Re: Quotes in query string cause NullPointerException

2009-10-01 Thread Andrew Clegg
=... :) Erik On Oct 1, 2009, at 9:49 AM, Andrew Clegg wrote: Hi folks, I'm using the 2009-09-30 build, and any single or double quotes in the query string cause an NPE. Is this normal behaviour? I never tried it with my previous installation. Example: http://myserver:8080/solr/select