Re: SOLR 1.2 - Duplicate Documents??
Schema.xml field name=id type=string indexed=true stored=true/ Have you edited schema.xml since building a full index from scratch? If so, try rebuilding the index. People often get the behavior you describe if the 'id' is a 'text' field. ryan
Re: SOLR 1.2 - Duplicate Documents??
: Hey all, I have a fairly odd case of duplicate documents in our solr index : (See attached xml sample). THe index is roughtly 35k in documents. The only How did you index those documents? Any chance you inadvertently set the allowDups=true attribute when sending them to Solr (possibly becuase of an option whose meaning you didn't fully understand in solrj or solr-ruby etc...) ? -Hoss
MultiCore unregister
For the MultiCore experts, is there an acceptable or approved way to close and unregister a single SolrCore? I'm interested in stopping cores, manipulating the solr directory tree, and reregistering them. Thanks, -John R.
Search Multiple indexes In Solr
Hi, I'm new to Solr but very familiar with Lucene. Is there a way to have Solr search in more than once index, much like the MultiSearcher in Lucene ? If so how so I configure the location of the indexes ?
Re: SOLR 1.2 - Duplicate Documents??
I haven't made any changes to the schema since the intial full-index. Do you know if there is a way to rebuild the full index in the background, without having to take down the current live index? Dan ryantxu wrote: Schema.xml field name=id type=string indexed=true stored=true/ Have you edited schema.xml since building a full index from scratch? If so, try rebuilding the index. People often get the behavior you describe if the 'id' is a 'text' field. ryan -- View this message in context: http://www.nabble.com/SOLR-1.2---Duplicate-Documents---tf4762687.html#a13629639 Sent from the Solr - User mailing list archive at Nabble.com.
Re: start.jar -Djetty.port= not working
Hi Brian, Found the SVN location, will download from there and give it a try. Thanks for the help. On 07/11/2007, Mike Davies [EMAIL PROTECTED] wrote: I'm using 1.2, downloaded from http://apache.rediris.es/lucene/solr/ Where can i get the trunk version? On 07/11/2007, Brian Whitman [EMAIL PROTECTED] wrote: On Nov 7, 2007, at 10:00 AM, Mike Davies wrote: java -Djetty.port=8521 -jar start.jar However when I run this it seems to ignore the command and still start on the default port of 8983. Any suggestions? Are you using trunk solr or 1.2? I believe 1.2 still shipped with an older version of jetty that doesn't follow the new-style CL arguments. I just tried it on trunk and it worked fine for me. -- http://variogr.am/ [EMAIL PROTECTED]
Re: start.jar -Djetty.port= not working
On Nov 7, 2007, at 10:00 AM, Mike Davies wrote: java -Djetty.port=8521 -jar start.jar However when I run this it seems to ignore the command and still start on the default port of 8983. Any suggestions? Are you using trunk solr or 1.2? I believe 1.2 still shipped with an older version of jetty that doesn't follow the new-style CL arguments. I just tried it on trunk and it worked fine for me. -- http://variogr.am/ [EMAIL PROTECTED]
Re: Can you parse the contents of a field to populate other fields?
On 11/6/07, Kristen Roth [EMAIL PROTECTED] wrote: Yonik - thanks so much for your help! Just to clarify; where should the regex go for each field? Each field should have a different FieldType (referenced by the type XML attribute). Each fieldType can have it's own analyzer. You can use a different PatternTokenizer (which specifies a regex) for each analyzer. -Yonik
Re: Sorting problem
Does anyone know what could be the problem? looks like it was a problem in the new query parser. I just fixed it in trunk: http://svn.apache.org/viewvc?view=revrevision=592740 Yonik - do we want to keep this checking for 'null', or should we change QueryParser.parseSort( ) to always return a valid sortSpec? ryan
Re: Sorting problem
Yonik Seeley wrote: On 11/7/07, Ryan McKinley [EMAIL PROTECTED] wrote: Yonik - do we want to keep this checking for 'null', or should we change QueryParser.parseSort( ) to always return a valid sortSpec? In Lucene, a null sort is not equal to score desc... they result in the same documents being returned, but the former takes a different code path and is faster. right, but solr QueryParsing.SortSpec holds a lucene Sort -- in either case the lucene Sort object is null. Since num offset were added to SortSpec, it can't be null anymore (i don't think) ryan
Re: highlight and wildcards ?
I fixed this problem by returning thisreturn super.getPrefixQuery(field, termStr); in solr.search.SolrQueryParser and it worked for me. -Kamran Mike Klaas wrote: On 7-Jun-07, at 5:27 PM, Frédéric Glorieux wrote: Hoss, Thanks for all your information and pointers. I know that my problems are not mainstream. Have you tried commenting out getPrefixQuery in solr.search.SolrQueryParser? It should then revert to a regular lucene prefix query. -Mike -- View this message in context: http://www.nabble.com/highlight-and-wildcards---tf3883191.html#a13632571 Sent from the Solr - User mailing list archive at Nabble.com.
Simple sorting questions
Pardon the basicness of these questions, but I'm just getting started with SOLR and have a couple of confusions regarding sorting that I couldn't resolve based on the docs or an archive search. 1. There appears to be (at least) two ways to specify sorting, one involving an append to the q parm and the other using the sort parm. Are these exactly equivalent? http://localhost/solr/select/?q=martha;author+asc http://localhost/solr/select/?q=marthasort=author+asc 2. The docs say that sorting can only be applied to non-multivalued fields. Does this mean that sorting won't work *at all* for multi-valued fields or only that the behaviour is indeterminate? Based on a brief test, sorting a multi-valued field appeared to work by picking an arbitrary value when multiple values are present and using that for the sort. I wanted to confirm that the expected behaviour is indeed to sort on something (with no guarantees as to what), as opposed to, say, dropping the record, putting the record with multi-values at the end with the missing valued records, or something else entirely. Thanks! Ron
RE: Can you parse the contents of a field to populate other fields?
So, I think I have things set up correctly in my schema, but it doesn't appear that any logic is being applied to my Category_# fields - they are being populated with the full string copied from the Category field (facet1::facet2::facet3...facetn) instead of just facet1, facet2, etc. I have several different field types, each with a different regex to match a specific part of the input string. In this example, I'm matching facet1 in input string facet1::facet2::facet3...facetn fieldtype name=cat1str class=solr.TextField analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=^([^:]+) group=1/ /analyzer /fieldtype I have copyfields set up for each Category_# field. Anything obviously wrong? Thanks! Kristen -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Wednesday, November 07, 2007 9:38 AM To: solr-user@lucene.apache.org Subject: Re: Can you parse the contents of a field to populate other fields? On 11/6/07, Kristen Roth [EMAIL PROTECTED] wrote: Yonik - thanks so much for your help! Just to clarify; where should the regex go for each field? Each field should have a different FieldType (referenced by the type XML attribute). Each fieldType can have it's own analyzer. You can use a different PatternTokenizer (which specifies a regex) for each analyzer. -Yonik
Re: how to use PHP AND PHPS?
On Nov 7, 2007, at 2:04 AM, James liu wrote: i just decrease answer information...and u will see my result(full, not part) *before unserialize* string(433) a:2:{s:14:responseHeader;a:3:{s:6:status;i:0;s:5:QTime;i: 0;s:6:params;a:7:{s:2:fl;s:5:Title;s:6:indent;s:2:on;s: 5:start;s:1:0;s:1:q;s:1:2;s:2:wt;s:4:phps;s:4:rows;a: 2:{i:0;s:1:2;i:1;s:2:10;}s:7:version;s:3: 2.2;}}s:8:response;a:3:{s:8:numFound;i:28;s:5:start;i:0;s: 4:docs;a:2:{i:0;a:1:{s:5:Title;d:诺基亚N-Gage基本数据;}i:1;a:1: {s:5:Title;d:索尼爱立信P908基本数据; *after unserialize...* bool(false) and i write serialize test code.. ?php $ar = array ( array('id' = 123, 'Title'= 中文测试), array('id' = 123, 'Title'= 中国上海), ); echo serialize($ar); ? and result is : a:2:{i:0;a:2:{s:2:id;i:123;s:5:Title;s:12:中文测试;}i:1;a:2: {s:2:id;i:123;s:5:Title;s:12:中国上海;}} *php* result is: string(369) array( 'responseHeader'=array( 'status'=0, 'QTime'=0, 'params'=array( 'fl'='Title', 'indent'='on', 'start'='0', 'q'='2', 'wt'='php', 'rows'=array('2', '10'), 'version'='2.2')), 'response'=array('numFound'=28,'start'=0,'docs'=array( array( 'Title'=诺基亚N-Gage基本数据), array( 'Title'=索尼爱立信P908基本数 据)) )) it is string, so i can't read it correctly by php. This part (after string(369)) is exactly what it you should be seeing if you use the php handler, and it's what you get after you unserialize when using phps. You can access your search results as: $solrResults['response']['docs']; In your example above, that would be: array( array('Title'=诺基亚N-Gage基本数据), array( 'Title'=索尼爱立信 P908基本数据)) When using the php handler, you must do something like this: eval('$solrResults = ' .$serializedSolrResults . ';'); Then, as above, you can access $solrResults['response']['docs']. To sum up, if you use phps, you must unserialize the results. If you use php, you must eval the results (including some sugar to get a variable set to that value). dave
RE: Analysis / Query problem
Thanks Erik. That helps. -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 07, 2007 11:36 AM To: solr-user@lucene.apache.org Subject: Re: Analysis / Query problem On Nov 7, 2007, at 10:26 AM, Wagner,Harry wrote: I have the following custom field defined for author names. After indexing the 2 documents below the admin analysis tool looks right for field-name=au and field-value=Schröder, Jürgen The highlight matching also seems right. However, if I search for au:Schröder, Jürgen using the admin tool I do not get any hits (see below). This appears to be the case whenever there are 2 non-ascii characters in the author name. Searching for au:Schröder, Jurgen finds both of these records. Any idea what is causing this? response lst name=responseHeader int name=status0/int int name=QTime0/int lst name=params str name=indenton/str str name=start0/str str name=qau:Schröder, Jürgen/str One thing to note is that query au:Schröder, Jürgen is being translated (try debugQuery=true to see) to: au:schröder AND/OR defaultField:jürgen AND/OR depends on how you have things configured, as well as the default field. You probably want to use the ISOLatin1AccentFilterFactory to have the diacritics flattened to the ASCII character they look like. Erik
Re: Analysis / Query problem
On Nov 7, 2007, at 10:26 AM, Wagner,Harry wrote: I have the following custom field defined for author names. After indexing the 2 documents below the admin analysis tool looks right for field-name=au and field-value=Schröder, Jürgen The highlight matching also seems right. However, if I search for au:Schröder, Jürgen using the admin tool I do not get any hits (see below). This appears to be the case whenever there are 2 non-ascii characters in the author name. Searching for au:Schröder, Jurgen finds both of these records. Any idea what is causing this? response lst name=responseHeader int name=status0/int int name=QTime0/int lst name=params str name=indenton/str str name=start0/str str name=qau:Schröder, Jürgen/str One thing to note is that query au:Schröder, Jürgen is being translated (try debugQuery=true to see) to: au:schröder AND/OR defaultField:jürgen AND/OR depends on how you have things configured, as well as the default field. You probably want to use the ISOLatin1AccentFilterFactory to have the diacritics flattened to the ASCII character they look like. Erik
Re: start.jar -Djetty.port= not working
On Nov 7, 2007, at 10:07 AM, Mike Davies wrote: I'm using 1.2, downloaded from http://apache.rediris.es/lucene/solr/ Where can i get the trunk version? svn, or http://people.apache.org/builds/lucene/solr/nightly/
restricting search to a set of documents
I need to perform a search against a limited set of documents. I have the set of document ids, but was wondering what is the best way to formulate the query to SOLR? -- View this message in context: http://www.nabble.com/restricting-search-to-a-set-of-documents-tf4767801.html#a13637479 Sent from the Solr - User mailing list archive at Nabble.com.
Re: restricting search to a set of documents
On 7-Nov-07, at 2:27 PM, briand wrote: I need to perform a search against a limited set of documents. I have the set of document ids, but was wondering what is the best way to formulate the query to SOLR? add fq=docId:(id1 id2 id3 id4 id5...) cheers, -Mike
unsubscribe
Jeryl Cook /^\ Pharaoh /^\ http://pharaohofkush.blogspot.com/ ..Act your age, and not your shoe size.. -Prince(1986) From: [EMAIL PROTECTED] Subject: Re: start.jar -Djetty.port= not working Date: Wed, 7 Nov 2007 10:13:22 -0500 To: solr-user@lucene.apache.org On Nov 7, 2007, at 10:07 AM, Mike Davies wrote: I'm using 1.2, downloaded from http://apache.rediris.es/lucene/solr/ Where can i get the trunk version? svn, or http://people.apache.org/builds/lucene/solr/nightly/
Re: What is the best way to index xml data preserving the mark up?
If you really, really need to preserve the XML structure, you'll be doing a LOT of work to make Solr do that. It might be cheaper to start with software that already does that. I recommend MarkLogic -- I know the principals there, and it is some seriously fine software. Not free or open, but very, very good. If your problem can be expressed in a flat field model, then the your problem is mapping your document model into Solr. You might be able to use structured field names to represent the XML context, but that is just a guess. With a mixed corpus of XML and arbitrary text, requiring special handling of XML, yow, that's a lot of work. One thought -- you can do flat fields in an XML engine (like MarkLogic) much more easily than you can do XML in a flat field engine (like Lucene). wunder On 11/7/07 8:18 PM, David Neubert [EMAIL PROTECTED] wrote: I am sure this is 101 question, but I am bit confused about indexing xml data using SOLR. I have rich xml content (books) that need to searched at granular levels (specifically paragraph and sentence levels very accurately, no approximations). My source text has exact p/p and s/s tags for this purpose. I have built this app in previous versions (using other search engines) indexing the text twice, (1) where every paragraph was a virtual document and (2) where every sentence was a virtual document -- both extracted from the source file (which was a singe xml file for the entire book). I have of course thought about using an XML engine eXists or Xindices, but I am prefer to the stability and user base and performance that Lucene/SOLR seems to have, and also there is a large body of text that is regular documents and not well formed XML as well. I am brand new to SOLR (one day) and at a basic level understand SOLR's nice simple xml scheme to add documents: add doc field name=foo1foo value 1/field field name=foo2foo value 2/field /doc doc.../doc /add But my problem is that I believe I need to perserve the xml markup at the paragraph and sentence levels, so I was hoping to create a content field that could just contain the source xml for the paragraph or sentence respectively. There are reasons for this that I won't go into -- alot of granular work in this app, accessing pars and sens. Obviously an XML mechanism that could leverage the xml structure (via XPath or XPointers) would work great. Still I think Lucene can do this in a field level way-- and I also can't imagine that users who are indexing XML documents have to go through the trouble of striping all the markup before indexing? Hopefully I missing something basic? It would be great to pointed in the right direction on this matter? I think I need something along this line: add doc field name=foo1value 1/field field name=foo2value 2/field field name=contentan xml stream with embedded source markup/field /doc /add Maybe the overall question -- is what is the best way to index XML content using SOLR -- is all this tag stripping really necessary? Thanks for any help, Dave __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
What is the best way to index xml data preserving the mark up?
I am sure this is 101 question, but I am bit confused about indexing xml data using SOLR. I have rich xml content (books) that need to searched at granular levels (specifically paragraph and sentence levels very accurately, no approximations). My source text has exact p/p and s/s tags for this purpose. I have built this app in previous versions (using other search engines) indexing the text twice, (1) where every paragraph was a virtual document and (2) where every sentence was a virtual document -- both extracted from the source file (which was a singe xml file for the entire book). I have of course thought about using an XML engine eXists or Xindices, but I am prefer to the stability and user base and performance that Lucene/SOLR seems to have, and also there is a large body of text that is regular documents and not well formed XML as well. I am brand new to SOLR (one day) and at a basic level understand SOLR's nice simple xml scheme to add documents: add doc field name=foo1foo value 1/field field name=foo2foo value 2/field /doc doc.../doc /add But my problem is that I believe I need to perserve the xml markup at the paragraph and sentence levels, so I was hoping to create a content field that could just contain the source xml for the paragraph or sentence respectively. There are reasons for this that I won't go into -- alot of granular work in this app, accessing pars and sens. Obviously an XML mechanism that could leverage the xml structure (via XPath or XPointers) would work great. Still I think Lucene can do this in a field level way-- and I also can't imagine that users who are indexing XML documents have to go through the trouble of striping all the markup before indexing? Hopefully I missing something basic? It would be great to pointed in the right direction on this matter? I think I need something along this line: add doc field name=foo1value 1/field field name=foo2value 2/field field name=contentan xml stream with embedded source markup/field /doc /add Maybe the overall question -- is what is the best way to index XML content using SOLR -- is all this tag stripping really necessary? Thanks for any help, Dave __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: how to use PHP AND PHPS?
hmm i find error,,,that is my error not about php and phps .. i use old config to testso config have a problem.. that is Title i use double as its type...it should use text. On Nov 8, 2007 10:29 AM, James liu [EMAIL PROTECTED] wrote: php now is ok.. but phps failed mycode: ?php $url = 'http://localhost:8080/solr1/select/?q=2version=2.2rows=2fl=Titlestart=0rows=10indent=onwt=phps '; $a = file_get_contents($url); //eval('$solrResults = ' .$serializedSolrResults . ';'); echo 'bbefore unserialize/bbr/'; var_dump($a); echo 'Br/'; $a = unserialize($a); echo 'bafter unserialize.../bbr/'; var_dump($a); ? and result: *before unserialize* string(434) a:2:{s:14:responseHeader;a:3:{s:6:status;i:0;s:5:QTime;i:32;s:6:params;a:7:{s:2:fl;s:5:Title;s:6:indent;s:2:on;s:5:start;s:1:0;s:1:q;s:1:2;s:2:wt;s:4:phps;s:4:rows;a:2:{i:0;s:1:2;i:1;s:2:10;}s:7:version;s:3: 2.2;}}s:8:response;a:3:{s:8:numFound;i:28;s:5:start;i:0;s:4:docs;a:2:{i:0;a:1:{s:5:Title;d:诺基亚N-Gage基本数据;}i:1;a:1:{s:5:Title;d:索尼爱立信P908基本数据; *after unserialize...* bool(false) On Nov 7, 2007 9:30 PM, Dave Lewis [EMAIL PROTECTED] wrote: On Nov 7, 2007, at 2:04 AM, James liu wrote: i just decrease answer information...and u will see my result(full, not part) *before unserialize* string(433) a:2:{s:14:responseHeader;a:3:{s:6:status;i:0;s:5:QTime;i: 0;s:6:params;a:7:{s:2:fl;s:5:Title;s:6:indent;s:2:on;s: 5:start;s:1:0;s:1:q;s:1:2;s:2:wt;s:4:phps;s:4:rows;a: 2:{i:0;s:1:2;i:1;s:2:10;}s:7:version;s:3: 2.2;}}s:8:response;a:3:{s:8:numFound;i:28;s:5:start;i:0;s: 4:docs;a:2:{i:0;a:1:{s:5:Title;d:诺基亚N-Gage基本数据;}i:1;a:1: {s:5:Title;d:索尼爱立信P908基本数据; *after unserialize...* bool(false) and i write serialize test code.. ?php $ar = array ( array('id' = 123, 'Title'= 中文测试), array('id' = 123, 'Title'= 中国上海), ); echo serialize($ar); ? and result is : a:2:{i:0;a:2:{s:2:id;i:123;s:5:Title;s:12:中文测试;}i:1;a:2: {s:2:id;i:123;s:5:Title;s:12:中国上海;}} *php* result is: string(369) array( 'responseHeader'=array( 'status'=0, 'QTime'=0, 'params'=array( 'fl'='Title', 'indent'='on', 'start'='0', 'q'='2', 'wt'='php', 'rows'=array('2', '10'), 'version'='2.2')), 'response'=array('numFound'=28,'start'=0,'docs'=array( array( 'Title'=诺基亚N-Gage基本数据), array( 'Title'=索尼爱立信P908基本数 据)) )) it is string, so i can't read it correctly by php. This part (after string(369)) is exactly what it you should be seeing if you use the php handler, and it's what you get after you unserialize when using phps. You can access your search results as: $solrResults['response']['docs']; In your example above, that would be: array( array('Title'=诺基亚N-Gage基本数据), array( 'Title'=索尼爱立信 P908基本数据)) When using the php handler, you must do something like this: eval('$solrResults = ' .$serializedSolrResults . ';'); Then, as above, you can access $solrResults['response']['docs']. To sum up, if you use phps, you must unserialize the results. If you use php, you must eval the results (including some sugar to get a variable set to that value). dave -- regards jl -- regards jl
Re: how to use PHP AND PHPS?
php now is ok.. but phps failed mycode: ?php $url = ' http://localhost:8080/solr1/select/?q=2version=2.2rows=2fl=Titlestart=0rows=10indent=onwt=phps '; $a = file_get_contents($url); //eval('$solrResults = ' .$serializedSolrResults . ';'); echo 'bbefore unserialize/bbr/'; var_dump($a); echo 'Br/'; $a = unserialize($a); echo 'bafter unserialize.../bbr/'; var_dump($a); ? and result: *before unserialize* string(434) a:2:{s:14:responseHeader;a:3:{s:6:status;i:0;s:5:QTime;i:32;s:6:params;a:7:{s:2:fl;s:5:Title;s:6:indent;s:2:on;s:5:start;s:1:0;s:1:q;s:1:2;s:2:wt;s:4:phps;s:4:rows;a:2:{i:0;s:1:2;i:1;s:2:10;}s:7:version;s:3: 2.2;}}s:8:response;a:3:{s:8:numFound;i:28;s:5:start;i:0;s:4:docs;a:2:{i:0;a:1:{s:5:Title;d:诺基亚N-Gage基本数据;}i:1;a:1:{s:5:Title;d:索尼爱立信P908基本数据; *after unserialize...* bool(false) On Nov 7, 2007 9:30 PM, Dave Lewis [EMAIL PROTECTED] wrote: On Nov 7, 2007, at 2:04 AM, James liu wrote: i just decrease answer information...and u will see my result(full, not part) *before unserialize* string(433) a:2:{s:14:responseHeader;a:3:{s:6:status;i:0;s:5:QTime;i: 0;s:6:params;a:7:{s:2:fl;s:5:Title;s:6:indent;s:2:on;s: 5:start;s:1:0;s:1:q;s:1:2;s:2:wt;s:4:phps;s:4:rows;a: 2:{i:0;s:1:2;i:1;s:2:10;}s:7:version;s:3: 2.2;}}s:8:response;a:3:{s:8:numFound;i:28;s:5:start;i:0;s: 4:docs;a:2:{i:0;a:1:{s:5:Title;d:诺基亚N-Gage基本数据;}i:1;a:1: {s:5:Title;d:索尼爱立信P908基本数据; *after unserialize...* bool(false) and i write serialize test code.. ?php $ar = array ( array('id' = 123, 'Title'= 中文测试), array('id' = 123, 'Title'= 中国上海), ); echo serialize($ar); ? and result is : a:2:{i:0;a:2:{s:2:id;i:123;s:5:Title;s:12:中文测试;}i:1;a:2: {s:2:id;i:123;s:5:Title;s:12:中国上海;}} *php* result is: string(369) array( 'responseHeader'=array( 'status'=0, 'QTime'=0, 'params'=array( 'fl'='Title', 'indent'='on', 'start'='0', 'q'='2', 'wt'='php', 'rows'=array('2', '10'), 'version'='2.2')), 'response'=array('numFound'=28,'start'=0,'docs'=array( array( 'Title'=诺基亚N-Gage基本数据), array( 'Title'=索尼爱立信P908基本数 据)) )) it is string, so i can't read it correctly by php. This part (after string(369)) is exactly what it you should be seeing if you use the php handler, and it's what you get after you unserialize when using phps. You can access your search results as: $solrResults['response']['docs']; In your example above, that would be: array( array('Title'=诺基亚N-Gage基本数据), array( 'Title'=索尼爱立信 P908基本数据)) When using the php handler, you must do something like this: eval('$solrResults = ' .$serializedSolrResults . ';'); Then, as above, you can access $solrResults['response']['docs']. To sum up, if you use phps, you must unserialize the results. If you use php, you must eval the results (including some sugar to get a variable set to that value). dave -- regards jl
Re: What is the best way to index xml data preserving the mark up?
On Wed, 7 Nov 2007 20:18:25 -0800 (PST) David Neubert [EMAIL PROTECTED] wrote: I am sure this is 101 question, but I am bit confused about indexing xml data using SOLR. I have rich xml content (books) that need to searched at granular levels (specifically paragraph and sentence levels very accurately, no approximations). My source text has exact p/p and s/s tags for this purpose. I have built this app in previous versions (using other search engines) indexing the text twice, (1) where every paragraph was a virtual document and (2) where every sentence was a virtual document -- both extracted from the source file (which was a singe xml file for the entire book). I have of course thought about using an XML engine eXists or Xindices, but I am prefer to the stability and user base and performance that Lucene/SOLR seems to have, and also there is a large body of text that is regular documents and not well formed XML as well. I am brand new to SOLR (one day) and at a basic level understand SOLR's nice simple xml scheme to add documents: add doc field name=foo1foo value 1/field field name=foo2foo value 2/field /doc doc.../doc /add But my problem is that I believe I need to perserve the xml markup at the paragraph and sentence levels, so I was hoping to create a content field that could just contain the source xml for the paragraph or sentence respectively. There are reasons for this that I won't go into -- alot of granular work in this app, accessing pars and sens. Obviously an XML mechanism that could leverage the xml structure (via XPath or XPointers) would work great. Still I think Lucene can do this in a field level way-- and I also can't imagine that users who are indexing XML documents have to go through the trouble of striping all the markup before indexing? Hopefully I missing something basic? It would be great to pointed in the right direction on this matter? I think I need something along this line: add doc field name=foo1value 1/field field name=foo2value 2/field field name=contentan xml stream with embedded source markup/field /doc /add Maybe the overall question -- is what is the best way to index XML content using SOLR -- is all this tag stripping really necessary? crazy/silly idea maybe... could you use dynamic fields, each containing a sentence, and a reference to the paragraph it belongs to ? eg, (not sure if the syntax is correct..) dynamicField name=s_* type=string / Then when you create your document you can define doc field name=s_1_p1{Sentence #1, Para#1}/field field name=s_2_p1{Sentence #2, Para#1}/field field name=s_3_p1{Sentence #3, Para#1}/field field name=s_1_p2{Sentence #1, Para#2}/field [...] /doc I have no idea how scalable that would be. cheers, B _ {Beto|Norberto|Numard} Meijome Immediate success shouldn't be necessary as a motivation to do the right thing. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Can you parse the contents of a field to populate other fields?
I'm not sure I fully understand your ultimate goal or Yonik's response. However, in the past I've been able to represent hierarchical data as a simple enumeration of delimited paths: field name=taxonomyroot/field field name=taxonomyroot/region/field field name=taxonomyroot/region/north america/field field name=taxonomyroot/region/south america/field Then, at response time, you can walk the result facet and build a hierarchy with counts that can be put into a tree view. The tree can be any arbitrary depth, and documents can live in any combination of nodes on the tree. In addition, you can represent any arbitrary name value pair (attribute/tuple) as a two level tree. That way, you can put any combination of attributes in the facet and parse them out at results list time. For example, you might be indexing computer hardware. Memory, Bus Speed and Resolution may be valid for some objects but not for others. Just put them in a facet and specify a separator: field name=attributememory:1GB/name field name=attributebusspeed:133Mhz/name field name=attributevoltage:110/220/name field name=attributemanufacturer:Shiangtsu/field When you do a facet query, you can easily display the categories appropriate to the object. And do facet selections like show me all green things and show me all size 4 things. Even if that's not your goal, this might help someone else. George Everitt On Nov 7, 2007, at 3:15 PM, Kristen Roth wrote: So, I think I have things set up correctly in my schema, but it doesn't appear that any logic is being applied to my Category_# fields - they are being populated with the full string copied from the Category field (facet1::facet2::facet3...facetn) instead of just facet1, facet2, etc. I have several different field types, each with a different regex to match a specific part of the input string. In this example, I'm matching facet1 in input string facet1::facet2::facet3...facetn fieldtype name=cat1str class=solr.TextField analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=^([^:]+) group=1/ /analyzer /fieldtype I have copyfields set up for each Category_# field. Anything obviously wrong? Thanks! Kristen -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Wednesday, November 07, 2007 9:38 AM To: solr-user@lucene.apache.org Subject: Re: Can you parse the contents of a field to populate other fields? On 11/6/07, Kristen Roth [EMAIL PROTECTED] wrote: Yonik - thanks so much for your help! Just to clarify; where should the regex go for each field? Each field should have a different FieldType (referenced by the type XML attribute). Each fieldType can have it's own analyzer. You can use a different PatternTokenizer (which specifies a regex) for each analyzer. -Yonik
Timeout in remote streaming
Hi, I'm sending a local csv file to Solr via remote streaming, and constantly get the 500 read timeout message. The csv file is about 200MB in size, and Solr is running on Tomcat 5.5. What types of timeout related Tomcat params I can adjust to fix this? Thanks in advance. - Guangwei
Re: What is the best way to index xml data preserving the mark up?
Thanks Walter -- I am aware of MarkLogic -- and agree -- but I have a very low budget on licensed software in this case (near 0) -- have you used eXists or Xindices? Dave - Original Message From: Walter Underwood [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, November 7, 2007 11:37:38 PM Subject: Re: What is the best way to index xml data preserving the mark up? If you really, really need to preserve the XML structure, you'll be doing a LOT of work to make Solr do that. It might be cheaper to start with software that already does that. I recommend MarkLogic -- I know the principals there, and it is some seriously fine software. Not free or open, but very, very good. If your problem can be expressed in a flat field model, then the your problem is mapping your document model into Solr. You might be able to use structured field names to represent the XML context, but that is just a guess. With a mixed corpus of XML and arbitrary text, requiring special handling of XML, yow, that's a lot of work. One thought -- you can do flat fields in an XML engine (like MarkLogic) much more easily than you can do XML in a flat field engine (like Lucene). wunder On 11/7/07 8:18 PM, David Neubert [EMAIL PROTECTED] wrote: I am sure this is 101 question, but I am bit confused about indexing xml data using SOLR. I have rich xml content (books) that need to searched at granular levels (specifically paragraph and sentence levels very accurately, no approximations). My source text has exact p/p and s/s tags for this purpose. I have built this app in previous versions (using other search engines) indexing the text twice, (1) where every paragraph was a virtual document and (2) where every sentence was a virtual document -- both extracted from the source file (which was a singe xml file for the entire book). I have of course thought about using an XML engine eXists or Xindices, but I am prefer to the stability and user base and performance that Lucene/SOLR seems to have, and also there is a large body of text that is regular documents and not well formed XML as well. I am brand new to SOLR (one day) and at a basic level understand SOLR's nice simple xml scheme to add documents: add doc field name=foo1foo value 1/field field name=foo2foo value 2/field /doc doc.../doc /add But my problem is that I believe I need to perserve the xml markup at the paragraph and sentence levels, so I was hoping to create a content field that could just contain the source xml for the paragraph or sentence respectively. There are reasons for this that I won't go into -- alot of granular work in this app, accessing pars and sens. Obviously an XML mechanism that could leverage the xml structure (via XPath or XPointers) would work great. Still I think Lucene can do this in a field level way-- and I also can't imagine that users who are indexing XML documents have to go through the trouble of striping all the markup before indexing? Hopefully I missing something basic? It would be great to pointed in the right direction on this matter? I think I need something along this line: add doc field name=foo1value 1/field field name=foo2value 2/field field name=contentan xml stream with embedded source markup/field /doc /add Maybe the overall question -- is what is the best way to index XML content using SOLR -- is all this tag stripping really necessary? Thanks for any help, Dave __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: MultiCore unregister
I was hoping that a feature was lurking about and not yet added to the patch. How about something like this? Should it throw an exception if the core isn't found in the map? Thanks, -jrr --- MultiCore.java.orig 2007-11-07 23:09:32.0 -0500 +++ MultiCore.java 2007-11-07 23:14:08.0 -0500 @@ -125,6 +125,25 @@ } } + /** + * Stop and unregister a core of the given name + * + * @param name + */ + public void shutdown ( String name ) + { +if ( name == null || name.length() == 0 ) { + throw new RuntimeException(Invalid core name.); +} +synchronized ( cores ) { + SolrCore core = cores.get(name); + if ( core != null ) { +cores.remove(name); +core.close(); + } +} + } + @Override protected void finalize() { shutdown(); Ryan McKinley wrote: Nothing yet... but check: https://issues.apache.org/jira/browse/SOLR-350 ryan John Reuning wrote: For the MultiCore experts, is there an acceptable or approved way to close and unregister a single SolrCore? I'm interested in stopping cores, manipulating the solr directory tree, and reregistering them. Thanks, -John R.