Re: How does HTMLStripWhitespaceTokenizerFactory work?
Ok. Thanks for the clarification. We will do the stripping before the indexing. On 11/06/07, Chris Hostetter [EMAIL PROTECTED] wrote: : Ok. Is it possible to get back the content without the html tags? Solr never does anything to modify the stored value of a field, so you'd really need to send Solr the value after strpping the HTML to get this to work. Internally, the HTMLStripWhitespaceTokenizerFactory does the HTML stripping as part of the tokenization process, so there is never a single markup free value for the field in Solr. -Hoss
Re: LIUS/Fulltext indexing
On 6/12/07, Yonik Seeley [EMAIL PROTECTED] wrote: ... I think Tika will be the way forward (some of the code for Tika is coming from LIUS)... Work has indeed started to incoroporate the Lius code into Tika, see https://issues.apache.org/jira/browse/TIKA-7 and http://incubator.apache.org/projects/tika.html -Bertrand
storing the document URI in the index
Hello, is it possible to configure solr to store the document URI in the lucene index (the URI is not an xml field, but just the document's location)? Or is everybody used to storing the contents of a document in the lucene index (doesn't this imply a much larger index though?), so instead of retrieving the document's content through a seperate fetch over http/filesystem just show the result from the stored content field? Thx in advance for any help, Regards Ard
Re: storing the document URI in the index
On Jun 12, 2007, at 8:51 AM, Ard Schrijvers wrote: is it possible to configure solr to store the document URI in the lucene index (the URI is not an xml field, but just the document's location)? Yes. Set the field to be store and non-indexed, field type string is what I use. Or is everybody used to storing the contents of a document in the lucene index (doesn't this imply a much larger index though?), so instead of retrieving the document's content through a seperate fetch over http/filesystem just show the result from the stored content field? This all depends on the needs of your project. Its perfectly fine to store the text outside of the index, and that is the way it really has to be done for very large indexes where as few fields as possible are stored. If you're also asking about Solr fetching the remote resource, that is a different story altogether, and no it does not do that. [though with the streaming capability you can feed in a document entirely from a URL, but I haven't experimented with that feature yet myself] Erik
RE: storing the document URI in the index
Hello Erik, thanks for the fast answer (sry for my mail not indenting but must use webmail :-( ), but the problem I am facing is that I do not see solr storing the location of the documents it indexed. So, I need to store the location of a document in a field, but I do not see where solr would do this. Fetching the document will be done with the simple cocoon generator, so that is no problem, but of course, I need the url/uri to be in the index. I know I need it as a UN_TOKENIZED STORED field, but just see with LUKE that the location is not present in lucene index when solr crawls some directory with xml files, Regards Ard Schrijvers Yes. Set the field to be store and non-indexed, field type string is what I use. Or is everybody used to storing the contents of a document in the lucene index (doesn't this imply a much larger index though?), so instead of retrieving the document's content through a seperate fetch over http/filesystem just show the result from the stored content field? This all depends on the needs of your project. Its perfectly fine to store the text outside of the index, and that is the way it really has to be done for very large indexes where as few fields as possible are stored. If you're also asking about Solr fetching the remote resource, that is a different story altogether, and no it does not do that. [though with the streaming capability you can feed in a document entirely from a URL, but I haven't experimented with that feature yet myself] Erik
Re: storing the document URI in the index
Ard, You have to store the URI in a Field yourself. That means you need to define that field in the schema and you have to set its value when adding documents. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Ard Schrijvers [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Tuesday, June 12, 2007 9:02:25 AM Subject: RE: storing the document URI in the index Hello Erik, thanks for the fast answer (sry for my mail not indenting but must use webmail :-( ), but the problem I am facing is that I do not see solr storing the location of the documents it indexed. So, I need to store the location of a document in a field, but I do not see where solr would do this. Fetching the document will be done with the simple cocoon generator, so that is no problem, but of course, I need the url/uri to be in the index. I know I need it as a UN_TOKENIZED STORED field, but just see with LUKE that the location is not present in lucene index when solr crawls some directory with xml files, Regards Ard Schrijvers Yes. Set the field to be store and non-indexed, field type string is what I use. Or is everybody used to storing the contents of a document in the lucene index (doesn't this imply a much larger index though?), so instead of retrieving the document's content through a seperate fetch over http/filesystem just show the result from the stored content field? This all depends on the needs of your project. Its perfectly fine to store the text outside of the index, and that is the way it really has to be done for very large indexes where as few fields as possible are stored. If you're also asking about Solr fetching the remote resource, that is a different story altogether, and no it does not do that. [though with the streaming capability you can feed in a document entirely from a URL, but I haven't experimented with that feature yet myself] Erik
RE: storing the document URI in the index
Hello Otis, thanks for the info. Would it a be an improvement to be able to specify in the schema.xml wether or not the URI should be stored or not in a field which name you can also specify in the schema? It might be very well possible that you do not own the xml documents you index over http, and at the same time, you do not want to store its contents in the index. Since at indexing time the uri is known, adding it to the index is trivial. Regards Ard You have to store the URI in a Field yourself. That means you need to define that field in the schema and you have to set its value when adding documents. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Ard Schrijvers [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Tuesday, June 12, 2007 9:02:25 AM Subject: RE: storing the document URI in the index Hello Erik, thanks for the fast answer (sry for my mail not indenting but must use webmail :-( ), but the problem I am facing is that I do not see solr storing the location of the documents it indexed. So, I need to store the location of a document in a field, but I do not see where solr would do this. Fetching the document will be done with the simple cocoon generator, so that is no problem, but of course, I need the url/uri to be in the index. I know I need it as a UN_TOKENIZED STORED field, but just see with LUKE that the location is not present in lucene index when solr crawls some directory with xml files, Regards Ard Schrijvers Yes. Set the field to be store and non-indexed, field type string is what I use. Or is everybody used to storing the contents of a document in the lucene index (doesn't this imply a much larger index though?), so instead of retrieving the document's content through a seperate fetch over http/filesystem just show the result from the stored content field? This all depends on the needs of your project. Its perfectly fine to store the text outside of the index, and that is the way it really has to be done for very large indexes where as few fields as possible are stored. If you're also asking about Solr fetching the remote resource, that is a different story altogether, and no it does not do that. [though with the streaming capability you can feed in a document entirely from a URL, but I haven't experimented with that feature yet myself] Erik
indexing documents (or pieces of a document) by access controls
Hi all, Can anyone give me some advice on breaking a document up and indexing it by access control lists. What we have are xml documents that are transformed based on the user viewing it. Some users might see all of the document, while other may see a few fields, and yet others see nothing at all. The access control lists may be a role the user belongs to, it may be a list of groups, or even a combination of the two. I can transform the xml to the plain text that I want to index, and key it off of the acls and then pass along a list of acls that the user issuing a query belongs to when searching. But I guess I'm not really sure how to do this the best way. Anyone have any thoughts? Thanks! Nate
Re: storing the document URI in the index
I'm afraid I don't understand your question. Perhaps somebody else does. Otis - Original Message From: Ard Schrijvers [EMAIL PROTECTED] To: solr-user@lucene.apache.org; solr-user@lucene.apache.org Sent: Tuesday, June 12, 2007 9:23:16 AM Subject: RE: storing the document URI in the index Hello Otis, thanks for the info. Would it a be an improvement to be able to specify in the schema.xml wether or not the URI should be stored or not in a field which name you can also specify in the schema? It might be very well possible that you do not own the xml documents you index over http, and at the same time, you do not want to store its contents in the index. Since at indexing time the uri is known, adding it to the index is trivial. Regards Ard You have to store the URI in a Field yourself. That means you need to define that field in the schema and you have to set its value when adding documents. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Ard Schrijvers [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Tuesday, June 12, 2007 9:02:25 AM Subject: RE: storing the document URI in the index Hello Erik, thanks for the fast answer (sry for my mail not indenting but must use webmail :-( ), but the problem I am facing is that I do not see solr storing the location of the documents it indexed. So, I need to store the location of a document in a field, but I do not see where solr would do this. Fetching the document will be done with the simple cocoon generator, so that is no problem, but of course, I need the url/uri to be in the index. I know I need it as a UN_TOKENIZED STORED field, but just see with LUKE that the location is not present in lucene index when solr crawls some directory with xml files, Regards Ard Schrijvers Yes. Set the field to be store and non-indexed, field type string is what I use. Or is everybody used to storing the contents of a document in the lucene index (doesn't this imply a much larger index though?), so instead of retrieving the document's content through a seperate fetch over http/filesystem just show the result from the stored content field? This all depends on the needs of your project. Its perfectly fine to store the text outside of the index, and that is the way it really has to be done for very large indexes where as few fields as possible are stored. If you're also asking about Solr fetching the remote resource, that is a different story altogether, and no it does not do that. [though with the streaming capability you can feed in a document entirely from a URL, but I haven't experimented with that feature yet myself] Erik
Re: storing the document URI in the index
On 6/12/07, Ard Schrijvers [EMAIL PROTECTED] wrote: thanks for the info. Would it a be an improvement to be able to specify in the schema.xml wether or not the URI should be stored or not in a field which name you can also specify in the schema? It might be very well possible that you do not own the xml documents you index over http, and at the same time, you do not want to store its contents in the index. Since at indexing time the uri is known, adding it to the index is trivial. Think of it a different way... Solr isn't indexing XML documents, it's simply using XML as a serialization format to pass the data to serialize. Often, a program is written to read some other data source (like a database), and send an XML message to Solr to index it (and hence the XML document only exists for a very brief time). -Yonik
Re: storing the document URI in the index
Solr doesn't have the URL of the document. The document is given to Solr in an HTTP POST. Solr is not a web spider, it is a search web service. wunder On 6/12/07 6:23 AM, Ard Schrijvers [EMAIL PROTECTED] wrote: Hello Otis, thanks for the info. Would it a be an improvement to be able to specify in the schema.xml wether or not the URI should be stored or not in a field which name you can also specify in the schema? It might be very well possible that you do not own the xml documents you index over http, and at the same time, you do not want to store its contents in the index. Since at indexing time the uri is known, adding it to the index is trivial. Regards Ard You have to store the URI in a Field yourself. That means you need to define that field in the schema and you have to set its value when adding documents. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Ard Schrijvers [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Tuesday, June 12, 2007 9:02:25 AM Subject: RE: storing the document URI in the index Hello Erik, thanks for the fast answer (sry for my mail not indenting but must use webmail :-( ), but the problem I am facing is that I do not see solr storing the location of the documents it indexed. So, I need to store the location of a document in a field, but I do not see where solr would do this. Fetching the document will be done with the simple cocoon generator, so that is no problem, but of course, I need the url/uri to be in the index. I know I need it as a UN_TOKENIZED STORED field, but just see with LUKE that the location is not present in lucene index when solr crawls some directory with xml files, Regards Ard Schrijvers Yes. Set the field to be store and non-indexed, field type string is what I use. Or is everybody used to storing the contents of a document in the lucene index (doesn't this imply a much larger index though?), so instead of retrieving the document's content through a seperate fetch over http/filesystem just show the result from the stored content field? This all depends on the needs of your project. Its perfectly fine to store the text outside of the index, and that is the way it really has to be done for very large indexes where as few fields as possible are stored. If you're also asking about Solr fetching the remote resource, that is a different story altogether, and no it does not do that. [though with the streaming capability you can feed in a document entirely from a URL, but I haven't experimented with that feature yet myself] Erik
RE: indexing documents (or pieces of a document) by access controls
Hello Nate, IMHO, you will not be able to do this in solr unless you accept pretty hard constraints on your ACLs (I will get back to this in a moment). IMO, it is not possible to index documents along with ACLs. ACLs can be very fine grained, and the thing you describe, ACL specific parts of a documentwell, I wouldn't know how you would index this. (imagine you change the ACL for a specific user. How do you know what to re-index and what not. Suppose you add a user? I really do not think it is possible based on fine grained ACLs). You also should realize you are trying to find an answer to an extremely complex problem: authorisation in an index (I am trying to develop facetted navigation in combination with authorisation in a lucene index in jackrabbit, but I think this is not the place to discuss it) So, in your case, if you want to use solr and some way of ACLs, I think basically you can only manage this if: 1) you ACLs are some sort of paths in a hiearchical based structure, where you index the hierarchical structure along with the content. Then when quering you have to include the folders that user is allowed to see 2) you need to keep bitset for each user which documents are allowed (but, you have even ACLs inside documents). Also, keeping bitsets up2date for many users is almost impossible, because - lucene document ids possible change after merging segments - updating documents might mean updating many many bitsets if you have many users For these reasons, I do not think you can achieve with solar what you want, unless you are going to work with something like: updating the index and ACL bitsets once a day. Regards Ard Can anyone give me some advice on breaking a document up and indexing it by access control lists. What we have are xml documents that are transformed based on the user viewing it. Some users might see all of the document, while other may see a few fields, and yet others see nothing at all. The access control lists may be a role the user belongs to, it may be a list of groups, or even a combination of the two. I can transform the xml to the plain text that I want to index, and key it off of the acls and then pass along a list of acls that the user issuing a query belongs to when searching. But I guess I'm not really sure how to do this the best way. Anyone have any thoughts? Thanks! Nate
RE: storing the document URI in the index
Thanks Yonik and Walter, putting it that way, it does make good sense to not store the transient xml file which it is most of the usecases (I was thinking differently because I do have xml files on file system or over http, like from a webdav call) Anyway, thx for all answers, and again, sry for mails not indenting properly at the moment, it irritates me as well :-) Regards Ard thanks for the info. Would it a be an improvement to be able to specify in the schema.xml wether or not the URI should be stored or not in a field which name you can also specify in the schema? It might be very well possible that you do not own the xml documents you index over http, and at the same time, you do not want to store its contents in the index. Since at indexing time the uri is known, adding it to the index is trivial. Think of it a different way... Solr isn't indexing XML documents, it's simply using XML as a serialization format to pass the data to serialize. Often, a program is written to read some other data source (like a database), and send an XML message to Solr to index it (and hence the XML document only exists for a very brief time). -Yonik
RE: indexing documents (or pieces of a document) by access controls
Excuse me, I meant solr ofcourse :-) For these reasons, I do not think you can achieve with solar
Tomcat: The requested resource (/solr/update) is not available.
Hi, I've got an app using Cocoon and Solr, both running through Tomcat. The post.sh file has been modified to grab local files, send it to Cocoon (via http), the Solr-fied xml from Cocoon is then sent to the update url in Tomcat/Solr. Not sure any of that is relevant though! I'm running the post.sh file like: post.sh ../xml/*.xml Which sends all of the files in xml to the post.sh script. Most of the POSTs work fine, but every once in a while I'll get: The requested resource (/solr/update) is not available. So my questions is this, is there a problem with sending all of those post requests to solr all at once? Should I be waiting to get an ok response before posting the next? Or is it OK to just blast solr like that? I'm wondering if its a Tomcat issue? Matt
RE: Multi-language indexing and searching
Daniel, I was reading your email and responses to it with great interest. I was aware that Solr has an implicit assumption that a field is mono-lingual per system. But your mail and its correspondence made me wonder if this limitation is practical for multi-lingual search applications. For bi-lingual or tri-lingual search, we can have parallel fields (title_en, title_fr, title_de, for example) but this wouldn't scale well. Assume we are making a search application for multi-lingual library in a university in Japan, for example, the application would have a book title field in Japanese, perhaps another title field in English for visiting scholars, and a title field in the original language. The last field's field would vary among more than 50 modern languages (and not so modern languages like Latin). Solr may need some rearchitecutring in this area. I work for a company called Basis Technology, (www.basistech.com) which develops a suite of language processing software and I've written a module to integrate this with Solr (and Lucene in general). The module is made of a universal Tokenizer and Analyzers for English and Japanese, but they can be modified easily to handle any of the 16 languages we can handle. (Source code is provided.) When I was developing this module, I thought of writing a super Analyzer that automatically detects the language and do the right thing. But I've found this won't fit well with the design of Lucene and Solr. For one thing, there is no way to save the detected language in the field, if the language is detected within the Analyzer. Lucene and Solr requires that the language be known before an Analyzer can be instantiated,and it's the Analyzer that detects the language in my design A second obstacle is that the kinds of Filters the Analyzer use depends on the language, so it must be dynamically changed. This could be done programatically but it's not easy. My big hope is that we can work together to come up with some way so that the detected language within the Analayzer can somehow be retrieved and made it into the field. Anyway, if you are interested in trying my multi-lingual Analyzers, please contact me in private email. Regards, -kuro
Re: Multi-language indexing and searching
On 6/12/07, Teruhiko Kurosaka [EMAIL PROTECTED] wrote: For bi-lingual or tri-lingual search, we can have parallel fields (title_en, title_fr, title_de, for example) but this wouldn't scale well. Due to search across multiple fields, or due to increased index size? Lucene and Solr requires that the language be known before an Analyzer can be instantiated,and it's the Analyzer that detects the language in my design A second obstacle is that the kinds of Filters the Analyzer use depends on the language, so it must be dynamically changed. This could be done programatically but it's not easy. My big hope is that we can work together to come up with some way so that the detected language within the Analayzer can somehow be retrieved and made it into the field. Something could be done for the indexing side of things, but then how do you query? Would you be able to do language detection on single word queries, or do you apply multiple analyzers and query the same field multiple ways (which seems very close to the multiple field approach)? Also, would multiple languages in a single field perhaps cause idf skew? 50 languages is a lot... perhaps a simple analyzer that could just try to break into words and lowercase? -Yonik
RE: storing the document URI in the index
On Tue, 2007-06-12 at 16:33 +0200, Ard Schrijvers wrote: Thanks Yonik and Walter, putting it that way, it does make good sense to not store the transient xml file which it is most of the usecases (I was thinking differently because I do have xml files on file system or over http, like from a webdav call) Anyway, thx for all answers, and again, sry for mails not indenting properly at the moment, it irritates me as well :-) Regards Ard Hi Ard, you may want to have a look at http://wiki.apache.org/solr/SolrForrest salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions
Re: indexing documents (or pieces of a document) by access controls
Hi all, Can anyone give me some advice on breaking a document up and indexing it by access control lists. What we have are xml documents that are transformed based on the user viewing it. Some users might see all of the document, while other may see a few fields, and yet others see nothing at all. The access control lists may be a role the user belongs to, it may be a list of groups, or even a combination of the two. I can transform the xml to the plain text that I want to index, and key it off of the acls and then pass along a list of acls that the user issuing a query belongs to when searching. But I guess I'm not really sure how to do this the best way. Anyone have any thoughts? Given the requirement to break down a document into separately controlled pieces, I'd create a servlet that fronts the Solr servlet and handles this conversion. I could think of ways to do it using Solr, but they feel like unnatural acts. As a general comment on ACLs, one relatively easy way to handle this is via group ids that you use to restrict the query. Each document has a groupid with a list of group ids that are authorized to access it. Each user query is converted into a (query) AND (groupid:xx OR groupid:yy), where xx/yy (and so on) are the groups that the user belongs to. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it
Re: indexing documents (or pieces of a document) by access controls
Hi And about the fields, if they are/aren't going to be present on the responses based on the user group, you can do it in many different ways (using XML transformation to remove the undesirable fields, implementing your own RequestHandler able to process your group information, filtering the data and showing only what should be shown to the user, ...) Regards, Daniel On 12/6/07 16:14, Ken Krugler [EMAIL PROTECTED] wrote: Hi all, Can anyone give me some advice on breaking a document up and indexing it by access control lists. What we have are xml documents that are transformed based on the user viewing it. Some users might see all of the document, while other may see a few fields, and yet others see nothing at all. The access control lists may be a role the user belongs to, it may be a list of groups, or even a combination of the two. I can transform the xml to the plain text that I want to index, and key it off of the acls and then pass along a list of acls that the user issuing a query belongs to when searching. But I guess I'm not really sure how to do this the best way. Anyone have any thoughts? Given the requirement to break down a document into separately controlled pieces, I'd create a servlet that fronts the Solr servlet and handles this conversion. I could think of ways to do it using Solr, but they feel like unnatural acts. As a general comment on ACLs, one relatively easy way to handle this is via group ids that you use to restrict the query. Each document has a groupid with a list of group ids that are authorized to access it. Each user query is converted into a (query) AND (groupid:xx OR groupid:yy), where xx/yy (and so on) are the groups that the user belongs to. -- Ken http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: LIUS/Fulltext indexing
Sounds interesting. I can't seem to find any clear dates on the project website. Do you know? ...V1 shipping date? Thanks! On 6/12/07, Bertrand Delacretaz [EMAIL PROTECTED] wrote: On 6/12/07, Yonik Seeley [EMAIL PROTECTED] wrote: ... I think Tika will be the way forward (some of the code for Tika is coming from LIUS)... Work has indeed started to incoroporate the Lius code into Tika, see https://issues.apache.org/jira/browse/TIKA-7 and http://incubator.apache.org/projects/tika.html -Bertrand
Re: LIUS/Fulltext indexing
On 6/12/07, Vish D. [EMAIL PROTECTED] wrote: ...Sounds interesting. I can't seem to find any clear dates on the project website. Do you know? ...V1 shipping date?... Not at the moment, Tika just entered incubation and it's impossible to predict what will happen. But help is welcome, of course ;-) -Bertrand
RE: Multi-language indexing and searching
Hi Yonik, On 6/12/07, Teruhiko Kurosaka [EMAIL PROTECTED] wrote: For bi-lingual or tri-lingual search, we can have parallel fields (title_en, title_fr, title_de, for example) but this wouldn't scale well. Due to search across multiple fields, or due to increased index size? Due to the prolification of number of fields. Say, we want to have the field title to have the title of the book in its original language. But because Solr has this implicit assumption of one language per field, we would have to have the artifitial fields title_fr, title_de, title_en, title_es, etc. etc. for the number of supported languages, only one of which has a ral value per document. This sounds silly, doesn't it? Something could be done for the indexing side of things, but then how do you query? Would you be able to do language detection on single word queries, or do you apply multiple analyzers and query the same field multiple ways (which seems very close to the multiple field approach)? You are right that the language auto-detection does not work on query. The search user would have to specify the language somehow. One commercial search engine vendor does this by prefixing a query term with $lang=en . I would do this by drop down list. Each user or session would have a default language that is configurable. Also, would multiple languages in a single field perhaps cause idf skew? Sorry, I don't know enough about inside of the search engines to discuss this. 50 languages is a lot... perhaps a simple analyzer that could just try to break into words and lowercase? This won't work because: (1) Concept of lowercase doesn't apply to all languages. (2) Even among languages that use Latin script, there can be different normalization rules. For many European languages, accent marks can be dropped (ü becomes u), but for German, ü may better be mapped to ue which is the alternative spelling of ü in German writing. (3) Some languages such as Chinese and Japanese does not even use space or other delimiters to indicate the word boundary. Language specific rules have to be applied just to extract words from the run of text. -kuro
RE: question about sorting
Thanks, Yonik. Unfortunately we have users whose first names contain more than one word, it seems copy field is my only option. Thanks Xuesong -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Tuesday, June 12, 2007 10:35 AM To: solr-user@lucene.apache.org Subject: Re: question about sorting On 6/11/07, Xuesong Luo [EMAIL PROTECTED] wrote: For example, first name, department, job title etc. Something like first name might be able to be a single field that is searchable and sortable (use a keyword tokenizer followed by a lowercase filter). If the field contains multiple words, and you want to both search and sort on that field, there isn't currently a better alternative to copyField. -Yonik
Re: question about sorting
On 6/12/07, Xuesong Luo [EMAIL PROTECTED] wrote: Thanks, Yonik. Unfortunately we have users whose first names contain more than one word, it seems copy field is my only option. Yes, if you need to be able to match on part of a first name, rather than just exact first name. -Yonik
RE: Multi-language indexing and searching
: Due to the prolification of number of fields. Say, we want : to have the field title to have the title of the book in : its original language. But because Solr has this implicit : assumption of one language per field, we would have to have : the artifitial fields title_fr, title_de, title_en, title_es, : etc. etc. for the number of supported languages, only one of : which has a ral value per document. This sounds silly, doesn't it? not really, i have indexes with *thousands* of fields ... if you turn field norms off it's extremely efficient, but even with norms: 50*n fields where n is the number of real fields you have (title, author, etc..) should work fine. furthermore, declaration of these fields can be simple -- if you have a language you want to treat special, then presumably you have a special analyzer for it. dynamicFields where the field name is the wildcard and the language is set can be used to handle all of the different indexed fields, dynamicField name=*english type=english / dynamicField name=*french type=french / dynamicField name=*spanish type=german / ...more like the above for each lanague you wnat to support... copyField source=*_english dest=english / copyField source=*_french dest=french / copyField source=*_spanish dest=spanish / ...more like the above for each lanague you wnat to support... and now you can index documents with fields like this... author_english = Mr. Chris Hostetter author_spanish = Senor Cristobol Hostetter body_english = I can't Believe It's not butter body_spanish = No puedo creer que no es mantaquea title_english = One Man's Disbelief ...and you can search on english:Chris, spanish:Cristobol, author_spanish:Cristobol, etc... you could even add dynamicFields with the field name set and the language wildcarded to handle any fields used solely for display with even less declaration (one per field instead of one per langauge) ... dynamicField name=display_title_* type=string / ... -Hoss
Re: To make sure XML is UTF-8
Hi Not sure if you've had a solution for your problem yet, but I had dealt with a similar issue that is mentioned below and hopefully it'll help you too. Of course, this assumes that your original data is in utf-8 format. The default charset encoding for mysql is Latin1 and our display format was utf-8 and that was the problem. These are the steps I performed to get the search data in utf-8 format.. Changed the my.cnf as so (though we can avoid this by executing commands on every new connection if we don't want the whole db in utf format): Under: [mysqld] added: # setting default charset to utf-8 collation_server=utf8_unicode_ci character_set_server=utf8 default-character-set=utf8 Under: [client] default-character-set=utf8 After changing, restarted mysqld, re-created the db, re-inserted all the data again in the db using my data insert code (java program) and re-created the Solr index. The key is to change the settings for both the mysqld and client sections in my.cnf - the mysqld setting is to make sure that mysql doesn't convert it to latin1 while storing the data and the client setting is to ensure that the data is not converted while accessing - going in or coming out from the server. Ajanta. Tiong Jeffrey wrote: Ya you are right! After I change it to UTF-8 the error still there... I looked at the log, this is what it appears, 127.0.0.1 - - [10/06/2007:03:52:06 +] POST /solr/update HTTP/1.1 500 4022 I tried to search but couldn't understand what error is this, anybody has any idea on this? Thanks!!! On 6/10/07, Chris Hostetter [EMAIL PROTECTED] wrote: : way during indexing is - FATAL: Connection error (is Solr running at : http://localhost/solr/update : ?): java.io.IOException: Server returned HTTP Response code: 500 for URL: : http://local/solr/update; : 4.Although the error code doesnt specify is XML utf-8 code error, but I did : a bit research, and look at the XML file that i have, it doesn't fulfill the : utf-8 encoding I *strongly* encourage you to look at the body of the response and/or the error log of your Servlet container and find out *exactly* what the cause of the error is ... you could spend a lot of time working on this and discover it's not your real problem. -Hoss
Re: LIUS/Fulltext indexing
Wonder if TOM could be useful to integrate? http://tom.library.upenn.edu/convert/sofar.html On 6/12/07, Bertrand Delacretaz [EMAIL PROTECTED] wrote: On 6/12/07, Vish D. [EMAIL PROTECTED] wrote: ...Sounds interesting. I can't seem to find any clear dates on the project website. Do you know? ...V1 shipping date?... Not at the moment, Tika just entered incubation and it's impossible to predict what will happen. But help is welcome, of course ;-) -Bertrand
Re: To make sure XML is UTF-8
Hi Ajanta, thanks! Since I used PHP, I managed to use the PHP decode function to change it to UTF-8. But just a question, even if we change mysql default char-set to UTF-8, and if the input originally is in other format, the mysql engine won't help to convert it to UTF-8 rite? I think my question is, what is the use of defining the char-set in mysql other than for labeling purpose? Thanks! Jeffrey On 6/13/07, Ajanta Phatak [EMAIL PROTECTED] wrote: Hi Not sure if you've had a solution for your problem yet, but I had dealt with a similar issue that is mentioned below and hopefully it'll help you too. Of course, this assumes that your original data is in utf-8 format. The default charset encoding for mysql is Latin1 and our display format was utf-8 and that was the problem. These are the steps I performed to get the search data in utf-8 format.. Changed the my.cnf as so (though we can avoid this by executing commands on every new connection if we don't want the whole db in utf format): Under: [mysqld] added: # setting default charset to utf-8 collation_server=utf8_unicode_ci character_set_server=utf8 default-character-set=utf8 Under: [client] default-character-set=utf8 After changing, restarted mysqld, re-created the db, re-inserted all the data again in the db using my data insert code (java program) and re-created the Solr index. The key is to change the settings for both the mysqld and client sections in my.cnf - the mysqld setting is to make sure that mysql doesn't convert it to latin1 while storing the data and the client setting is to ensure that the data is not converted while accessing - going in or coming out from the server. Ajanta. Tiong Jeffrey wrote: Ya you are right! After I change it to UTF-8 the error still there... I looked at the log, this is what it appears, 127.0.0.1 - - [10/06/2007:03:52:06 +] POST /solr/update HTTP/1.1 500 4022 I tried to search but couldn't understand what error is this, anybody has any idea on this? Thanks!!! On 6/10/07, Chris Hostetter [EMAIL PROTECTED] wrote: : way during indexing is - FATAL: Connection error (is Solr running at : http://localhost/solr/update : ?): java.io.IOException: Server returned HTTP Response code: 500 for URL: : http://local/solr/update; : 4.Although the error code doesnt specify is XML utf-8 code error, but I did : a bit research, and look at the XML file that i have, it doesn't fulfill the : utf-8 encoding I *strongly* encourage you to look at the body of the response and/or the error log of your Servlet container and find out *exactly* what the cause of the error is ... you could spend a lot of time working on this and discover it's not your real problem. -Hoss
Compass vs Solr
Hi Everyone, We have a web application with search functionality built using lucene. The search is across different types of data, so it does not scale well from the database. As lucene does not allow to store relational data, we decided to try out Compass since it provides a object relation mapping to the lucene index. We have got good results with compass when compared to the database search. But before we migrate the all the other search workflows to use compass, we are trying to evaluate Solr. We will need to scale our application as our data is increasing by the day. Can anyone suggest which one would perform/scale well Compass or Solr? OR Has anyone tried to use a combination of Compass Solr? Any suggestion would be appreciated. Thanks, Harini