SolrJ: clusters, labels, docs - search results
Hello, Was wondering how to access the cluster labels, and docs(ids) via SolrJ? I have added the following: query.seParam(q, userQuery); query.setParam(clustering, true); query.setParam(qt, /core2/clustering); query.setParam(carrot.title, title); But how to access the labels, docs in the clusters and display in a search result? Also, I've seen others specify clustering in this manner... ModifiableSolrParams params = new ModifiableSolrParams(); params.set(qt, /core2/clustering); params.set(q, userQuery); params.set(carrot.title, title); params.set(clustering, true); Is this preferred over the other? Thanks
solr: adding a string on to a field via DIH
Hello, Is it possible to concatenate a field via DIH? For example for the id field, in order to make it unique I want to add 'project' to the beginning of the id field. So the field would look like 'project1234' Is this possible? field column=id name=id / Thanks
Re: solr: adding a string on to a field via DIH
Thanks guys. I had taken a quick look at the Template Transformer and it looks it does what I need it to dodidn't see the 'hello' part when reviewing earlier. On Sat, May 5, 2012 at 11:47 AM, Jack Krupansky j...@basetechnology.comwrote: Sounds like you need a Template Transformer: ... it helps to concatenate multiple values or add extra characters to field for injection. entity name=e transformer=**TemplateTransformer .. field column=namedesc template=hello${e.name},${**eparent.surname} / ... /entity See: http://wiki.apache.org/solr/**DataImportHandler#**TemplateTransformerhttp://wiki.apache.org/solr/DataImportHandler#TemplateTransformer Or did you have something different in mind? -- Jack Krupansky -Original Message- From: okayndc Sent: Saturday, May 05, 2012 9:12 AM To: solr-user@lucene.apache.org Subject: solr: adding a string on to a field via DIH Hello, Is it possible to concatenate a field via DIH? For example for the id field, in order to make it unique I want to add 'project' to the beginning of the id field. So the field would look like 'project1234' Is this possible? field column=id name=id / Thanks
Re: how to present html content in browse
Hello, I'm having a hard time understanding this, and I had this same question. When using DIH should the HTML field be stored in the raw HTML string field or the stripped field? Also what source field(s) need to be copied and to what destination? Thanks On Thu, May 3, 2012 at 10:15 PM, Lance Norskog goks...@gmail.com wrote: Make two fields, one with stores the stripped HTML and another that stores the parsed HTML. You can use copyField so that you do not have to submit the html page twice. You would mark the stripped field 'indexed=true stored=false' and the full text field the other way around. The full text field should be a String type. On Thu, May 3, 2012 at 1:04 PM, srini softtec...@gmail.com wrote: I am indexing records from database using DIH. The content of my record is in html format. When I use browse I would like to show the content in html format, not in text format. Any ideas? -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
solr: how to change display name of a facet?
Hello, Is there a way to change the display name (that contains spaces or special characters) for a facet without changing the value of the facet field? For example if my facet field name is 'category', I want to change the display name of the facet to 'Categories and Stuff' I've experimented with this: str name=facet.field{!ex=dt key=Categories and Stuff}category/str I'm not really sure what 'ex=dt' does but it's obvious that 'key' is the desired display name? If there are spaces in the 'key' value, the display name gets cut off. What am I doing wrong? Any help is greatly appreciated.
Re: solr: how to change display name of a facet?
Awesome, thanks! On Thu, May 3, 2012 at 2:32 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, May 3, 2012 at 2:26 PM, okayndc bodymo...@gmail.com wrote: [...] I've experimented with this: str name=facet.field{!ex=dt key=Categories and Stuff}category/str I'm not really sure what 'ex=dt' does but it's obvious that 'key' is the desired display name? If there are spaces in the 'key' value, the display name gets cut off. What am I doing wrong? http://wiki.apache.org/solr/LocalParams For a non-simple parameter value, enclose it in single quotes ex excludes filters tagged with a value. See http://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParams -Yonik lucenerevolution.com - Lucene/Solr Open Source Search Conference. Boston May 7-10
Re: extracting/indexing HTML via cURL
Thank you Jack. So, it's not doable/possible to search and highlight keywords within a field that contains the raw formatted HTML? and strip out the HTML tags during analysis...so that a user would get back nothing if they did a search for (ex. p)? On Mon, Apr 30, 2012 at 5:17 PM, Jack Krupansky j...@basetechnology.comwrote: I was thinking that you wanted to index the actual text from the HTML page, but have the stored field value still have the raw HTML with tags. If you just want to store only the raw HTML, a simple string field is sufficient, but then you can't easily do a text search on it. Or, you can have two fields, one string field for the raw HTML (stored, but not indexed) and then do a CopyField to a text field field that has the HTMLStripCharFilter to strip the HTML tags and index only the text (indexed, but not stored.) -- Jack Krupansky -Original Message- From: okayndc Sent: Monday, April 30, 2012 5:06 PM To: solr-user@lucene.apache.org Subject: Re: Solr: extracting/indexing HTML via cURL Great, thank you for the input. My understanding of HTMLStripCharFilter is that it strips HTML tags, which is not what I want ~ is this correct? I want to keep the HTML tags intact. On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky j...@basetechnology.com **wrote: If by extracting HTML content via cURL you mean using SolrCell to parse html files, this seems to make sense. The sequence is that regardless of the file type, each file extraction parser will strip off all formatting and produce a raw text stream. Office, PDF, and HTML files are all treated the same in that way. Then, the unformatted text stream is sent through the field type analyzers to be tokenized into terms that Lucene can index. The input string to the field type analyzer is what gets stored for the field, but this occurs after the extraction file parser has already removed formatting. No way for the formatting to be preserved in that case, other than to go back to the original input document before extraction parsing. If you really do want to preserve full HTML formatted text, you would need to define a field whose field type uses the HTMLStripCharFilter and then directly add documents that direct the raw HTML to that field. There may be some other way to hook into the update processing chain, but that may be too much effort compared to the HTML strip filter. -- Jack Krupansky -Original Message- From: okayndc Sent: Monday, April 30, 2012 10:07 AM To: solr-user@lucene.apache.org Subject: Solr: extracting/indexing HTML via cURL Hello, Over the weekend I experimented with extracting HTML content via cURL and just wondering why the extraction/indexing process does not include the HTML tags. It seems as though the HTML tags either being ignored or stripped somewhere in the pipeline. If this is the case, is it possible to include the HTML tags, as I would like to keep the formatted HTML intact? Any help is greatly appreciated.
Re: extracting/indexing HTML via cURL
Awesome, I'll give it try. Thanks Jack! On Tue, May 1, 2012 at 10:23 AM, Jack Krupansky j...@basetechnology.comwrote: Sorry for the confusion. It is doable. If you feed the raw HTML into a field that has the HTMLStripCharFilter, the stored value will retain the HTML tags, while the indexed text will be stripped of the of the tags during analysis and be searchable just like a normal text field. Then, search will not see p. -- Jack Krupansky -Original Message- From: okayndc Sent: Tuesday, May 01, 2012 10:08 AM To: solr-user@lucene.apache.org Subject: Re: extracting/indexing HTML via cURL Thank you Jack. So, it's not doable/possible to search and highlight keywords within a field that contains the raw formatted HTML? and strip out the HTML tags during analysis...so that a user would get back nothing if they did a search for (ex. p)? On Mon, Apr 30, 2012 at 5:17 PM, Jack Krupansky j...@basetechnology.com* *wrote: I was thinking that you wanted to index the actual text from the HTML page, but have the stored field value still have the raw HTML with tags. If you just want to store only the raw HTML, a simple string field is sufficient, but then you can't easily do a text search on it. Or, you can have two fields, one string field for the raw HTML (stored, but not indexed) and then do a CopyField to a text field field that has the HTMLStripCharFilter to strip the HTML tags and index only the text (indexed, but not stored.) -- Jack Krupansky -Original Message- From: okayndc Sent: Monday, April 30, 2012 5:06 PM To: solr-user@lucene.apache.org Subject: Re: Solr: extracting/indexing HTML via cURL Great, thank you for the input. My understanding of HTMLStripCharFilter is that it strips HTML tags, which is not what I want ~ is this correct? I want to keep the HTML tags intact. On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky j...@basetechnology.com **wrote: If by extracting HTML content via cURL you mean using SolrCell to parse html files, this seems to make sense. The sequence is that regardless of the file type, each file extraction parser will strip off all formatting and produce a raw text stream. Office, PDF, and HTML files are all treated the same in that way. Then, the unformatted text stream is sent through the field type analyzers to be tokenized into terms that Lucene can index. The input string to the field type analyzer is what gets stored for the field, but this occurs after the extraction file parser has already removed formatting. No way for the formatting to be preserved in that case, other than to go back to the original input document before extraction parsing. If you really do want to preserve full HTML formatted text, you would need to define a field whose field type uses the HTMLStripCharFilter and then directly add documents that direct the raw HTML to that field. There may be some other way to hook into the update processing chain, but that may be too much effort compared to the HTML strip filter. -- Jack Krupansky -Original Message- From: okayndc Sent: Monday, April 30, 2012 10:07 AM To: solr-user@lucene.apache.org Subject: Solr: extracting/indexing HTML via cURL Hello, Over the weekend I experimented with extracting HTML content via cURL and just wondering why the extraction/indexing process does not include the HTML tags. It seems as though the HTML tags either being ignored or stripped somewhere in the pipeline. If this is the case, is it possible to include the HTML tags, as I would like to keep the formatted HTML intact? Any help is greatly appreciated.
Solr: extracting/indexing HTML via cURL
Hello, Over the weekend I experimented with extracting HTML content via cURL and just wondering why the extraction/indexing process does not include the HTML tags. It seems as though the HTML tags either being ignored or stripped somewhere in the pipeline. If this is the case, is it possible to include the HTML tags, as I would like to keep the formatted HTML intact? Any help is greatly appreciated.
Re: Solr: extracting/indexing HTML via cURL
Great, thank you for the input. My understanding of HTMLStripCharFilter is that it strips HTML tags, which is not what I want ~ is this correct? I want to keep the HTML tags intact. On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky j...@basetechnology.comwrote: If by extracting HTML content via cURL you mean using SolrCell to parse html files, this seems to make sense. The sequence is that regardless of the file type, each file extraction parser will strip off all formatting and produce a raw text stream. Office, PDF, and HTML files are all treated the same in that way. Then, the unformatted text stream is sent through the field type analyzers to be tokenized into terms that Lucene can index. The input string to the field type analyzer is what gets stored for the field, but this occurs after the extraction file parser has already removed formatting. No way for the formatting to be preserved in that case, other than to go back to the original input document before extraction parsing. If you really do want to preserve full HTML formatted text, you would need to define a field whose field type uses the HTMLStripCharFilter and then directly add documents that direct the raw HTML to that field. There may be some other way to hook into the update processing chain, but that may be too much effort compared to the HTML strip filter. -- Jack Krupansky -Original Message- From: okayndc Sent: Monday, April 30, 2012 10:07 AM To: solr-user@lucene.apache.org Subject: Solr: extracting/indexing HTML via cURL Hello, Over the weekend I experimented with extracting HTML content via cURL and just wondering why the extraction/indexing process does not include the HTML tags. It seems as though the HTML tags either being ignored or stripped somewhere in the pipeline. If this is the case, is it possible to include the HTML tags, as I would like to keep the formatted HTML intact? Any help is greatly appreciated.
escaping HTML tags within XML file
Hello, Was wondering if it is necessary to escape HTML tags within an XML file for indexing? If so, seems like a large XML files with tons of HTML tags could get really messy (using CDATA). Has this been your experience? Do you escape the HTML tags? If so, what technique do you use? Or do you leave the HTML tags in place without escaping them? Thanks!
Re: escaping HTML tags within XML file
Here is a representation of the XML file... root commenter commentpText here/pimg src=image.gif /pMore text here/p/comment /commenter /root I want to keep the HTML tags because it keeps the formatting (paragraph tags, etc) intact for the output. Seems like you're saying that the HTML can be kept intact with the use of a HTML field type without having to escape the HTML tags? On Sun, Sep 25, 2011 at 2:52 PM, pulkitsing...@gmail.com wrote: Assuming that the XML has the HTML as values inside fully formed tags like so: nodeHTML/HTML/node then I think that using the HTML field type in schema.xml for indexing/storing will allow you to do meaningful searches on the content of the HTML without getting confused by the HTML syntax itself. If you have absolutely no need for the entire stored HTML when presenting results to the user then stripping out the syntax at index time makes sense. This will adversely affect highlighting of that document field as well so just know your requirements. If you don't want to present anything at all then don't store, just index and use the right field type (HTML) such that search results find the right document. Just because a field is helpful in finding the doc, doesn't mean folks always want to present it or store it. With Data Import Handler a HTML stripping transformer is present so that it is removed before the indexer gets it's hands on things. I can't be sure if that is how you get your data into Solr. - Pulkit Sent from my iPhone On Sep 25, 2011, at 8:00 AM, okayndc bodymo...@gmail.com wrote: Hello, Was wondering if it is necessary to escape HTML tags within an XML file for indexing? If so, seems like a large XML files with tons of HTML tags could get really messy (using CDATA). Has this been your experience? Do you escape the HTML tags? If so, what technique do you use? Or do you leave the HTML tags in place without escaping them? Thanks!
running SOLR on same server as your website
Hi everyone! Is it not a good practice to run SOLR on the same server where you website files sit? Or is it a MUST to house SOLR on it's own application server? The problem that I'm facing is that, my website's files sit on a servlet container (Tomcat) and I think it would be more convenient to house the SOLR instance on the same server? Is this not a good idea? What is your SOLR setup? Thanks
Re: running SOLR on same server as your website
In the context of application, I assume that you mean SOLRJ (for example)? On Wed, Sep 7, 2011 at 10:04 AM, Erik Hatcher erik.hatc...@gmail.comwrote: It's not necessarily a bad idea... as long as you secure it properly such that user requests cannot hit Solr, only requests from your application can do so. Eventually, perhaps, scale would be an issue and you'd want/need to separate the tiers, but as long as you've got security and scalability covered there's no reason not to deploy together like that. Erik On Sep 7, 2011, at 10:01 , okayndc wrote: Hi everyone! Is it not a good practice to run SOLR on the same server where you website files sit? Or is it a MUST to house SOLR on it's own application server? The problem that I'm facing is that, my website's files sit on a servlet container (Tomcat) and I think it would be more convenient to house the SOLR instance on the same server? Is this not a good idea? What is your SOLR setup? Thanks
Re: running SOLR on same server as your website
Right now, the index is relatively small in size ~less than 1mb. I think right now, it's okay but, a couple years down the road, we may have to transfer SOLR onto a separate application server. On Wed, Sep 7, 2011 at 10:15 AM, Jaeger, Jay - DOT jay.jae...@dot.wi.govwrote: You could host Solr inside the same Tomcat container, or in a different servlet container (say, a second Tomcat instance) on the same server. Be aware of your OS memory requirements, though: In my experience, Solr performs best when it has lots of OS memory to cache index files (at least, if your index is very big). For that reason alone, we chose to host our Solr instance (used internally only) in a separate virtual machine in its own web app server instance. It is all a matter of managing your memory, CPU and disk performance. If those are already constrained or nearly constrained on your website, then adding Solr into that mix is probably not such a good idea. If those are not issues on your existing website, and your Solr load is modest, then you can probably squeeze it onto the same server. Like most real-world answers, it comes down to it depends. JRJ -Original Message- From: okayndc [mailto:bodymo...@gmail.com] Sent: Wednesday, September 07, 2011 9:02 AM To: solr-user@lucene.apache.org Subject: running SOLR on same server as your website Hi everyone! Is it not a good practice to run SOLR on the same server where you website files sit? Or is it a MUST to house SOLR on it's own application server? The problem that I'm facing is that, my website's files sit on a servlet container (Tomcat) and I think it would be more convenient to house the SOLR instance on the same server? Is this not a good idea? What is your SOLR setup? Thanks
Re: solr/velocity: funtion for sorting asc/desc
Thanks Eric. So if I had a link Sort Title and the default is sort=title desc how can i switch that to sort=title asc? example: http://# Sort Tile (default sort=title desc) user clicks on link and sort should toggle(or switch) to sort=title asc how can this be achieved? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-velocity-funtion-for-sorting-asc-desc-tp3163549p3167267.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr/velocity: funtion for sorting asc/desc
Awesome, thanks Erik! -- View this message in context: http://lucene.472066.n3.nabble.com/solr-velocity-funtion-for-sorting-asc-desc-tp3163549p3167662.html Sent from the Solr - User mailing list archive at Nabble.com.
solr/velocity: funtion for sorting asc/desc
hello, was wondering if there is a solr/velocity function out there that can sort say, a title name, by clicking on a link named sort title that can sort ascending or descending by alpha? or is this a frontend/jquery type of thing? thanks -- View this message in context: http://lucene.472066.n3.nabble.com/solr-velocity-funtion-for-sorting-asc-desc-tp3163549p3163549.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Exact phrase highlighting
has this bug been fixed? i'm using solr 3.1 and it still seems to be an issue. if i do a search for bird house i still get embird/em emhouse/em returned instead of embird house/em, which is the desired result. -- View this message in context: http://lucene.472066.n3.nabble.com/Exact-phrase-highlighting-tp480339p3113824.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: velocity: hyperlinking to documents
Yes, from the handy /browse view. I'll give this a try. Thanks Erik! -- View this message in context: http://lucene.472066.n3.nabble.com/velocity-hyperlinking-to-documents-tp3091504p3100957.html Sent from the Solr - User mailing list archive at Nabble.com.
velocity: hyperlinking to documents
hello, i'm not sure of the correct velocity syntax to link, let's say a title field, to the actual document itself. i have a hostname, a category (which is also the directory where the file sits) and filename fields in my schema. can i potentially use these fields to get at the document itself? -- View this message in context: http://lucene.472066.n3.nabble.com/velocity-hyperlinking-to-documents-tp3091504p3091504.html Sent from the Solr - User mailing list archive at Nabble.com.