Re: Regex replacement not working!
I have had the same problems with regex and I went with the regular pattern replace filter rather than the charfilter. When I added it to the very end of the chain, only then would it work...I am on Solr 3.2. I have also noticed that the HTML filter factory is not working either. When I dump the field that it's supposed to be working on, all the hyperlinks and everything that you would expect to be stripped are still present. Adam On Wed, Jun 29, 2011 at 10:04 AM, samuele.mattiuzzo samum...@gmail.comwrote: ok, last question on the UpdateProcessor: can you please give me the steps to implement my own? i mean, i can push my custom processor in solr's code, and then what? i don't understand how i have to change the solrconf.xml and how can i bind that to the updater i just wrotea and also i don't understand how i do have to change the schema.xml i'm sorry for this question, but i started working on solr 5 days ago and for some things i really need a lot of documentation, and this isn't fully covered anywhere -- View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121743.html Sent from the Solr - User mailing list archive at Nabble.com.
REGEX Proper Usage?
All, I am having trouble getting my regex pattern to work properly. I have tried PatternReplaceFilterFactory after the standard tokenizer filter class=solr.PatternReplaceFilterFactory pattern=([^a-z0-9]) replacement= replace=all/ and PatternReplaceCharFilterFactory before it. charFilter class=solr.PatternReplaceCharFilterFactory pattern=([^a-zA-Z0-9]) replacement= replace=all/ It looks like this should work to remove everything except letters and numbers. charFilter class=solr.HTMLStripCharFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_en.txt enablePositionIncrements=true / filter class=solr.LengthFilterFactory min=2 max=999/ filter class=solr.PatternReplaceFilterFactory pattern=([^a-z0-9]) replacement= replace=all/ I am left with quite a few facet items like this int name=_ view1443/int int name=view _1599/int Can anyone suggest what may be going on here? I have verified that my regex works properly here http://www.fileformat.info/tool/regex.htm Adam
Re: Mahout Solr
You're right...It would be nice to be able to see the cluster results coming from Solr though... Adam On Thu, Jun 16, 2011 at 3:21 AM, Andrew Clegg andrew.clegg+mah...@gmail.com wrote: Well, it does have the ability to pull TermVectors from an index: https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html#CreatingVectorsfromText-FromLucene Nothing Solr-specific about it though. On 15 June 2011 15:38, Mark static.void@gmail.com wrote: Apache Mahout is a new Apache TLP project to create scalable, machine learning algorithms under the Apache license. It is related to other Apache Lucene projects and integrates well with Solr. How does Mahout integrate well with Solr? Can someone explain a brief overview on whats available. I'm guessing one of the features would be the replacing of the Carrot2 clustering algorithm with something a little more sophisticated? Thanks -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
Re: Mahout Solr
The only integration at this point (as far as I can tell) is that Mahout can read the lucene index created by Solr. I agree that it would be nice to swap out the Carrot2 clustering engine with Mahout's set of algorithms but that has not been done yet. Grant has pointed out that you can use Solr's callback system to fire off another task like a mahout job. http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/ Adam On Wed, Jun 15, 2011 at 10:38 AM, Mark static.void@gmail.com wrote: Apache Mahout is a new Apache TLP project to create scalable, machine learning algorithms under the Apache license. It is related to other Apache Lucene projects and integrates well with Solr. How does Mahout integrate well with Solr? Can someone explain a brief overview on whats available. I'm guessing one of the features would be the replacing of the Carrot2 clustering algorithm with something a little more sophisticated? Thanks
[Handling] empty fields
All, I have a field foo with several thousand blank or non-existing records in it. This is also my faceting field. My question is, how can I deal with this field so that I don't get a blank facet at query time? int name=5000/int vs. int name=Flickr1000/int Adam
Re: Finding Keywords/Phrases
Hi Frank, I have been working on something very similar and I am at the point where I don't believe (and I could be totally wrong) that a pure Solr solution is going to do this. I would look at Mahout and play with some of the machine learning algorithms that it can run against a Lucene index. I have not gotten any further than experimenting with it right now but so far it looks promising. Adam On Sun, Jun 12, 2011 at 10:20 AM, Frank A fsa...@gmail.com wrote: I have a single copyfield that has a number of other fields copied to it. I'm trying to extract a list of keywords and common terms. I realize it may not be a 100% dynamic and I may need to manually filter. Right now I tried using a CommonGrams filter. However, what I see is it creates tokens for both hot dog and hot dog. Is there anyway from within solr configuration to treat hot's frequency as hot when not followed by dog. For example, right now I may see a term/frequency of: hot 8 dog 6 hot dog 6 What I really want is: hot dog 6 hot 2 Any ideas?
[Mahout] Integration with Solr
Has anyone integrated Mahout with Solr? I know that Carrot2 is part of the core build but the docs say that it's not very good for very large indexes. Anyone have thoughts on this? Thanks, Adam
[Free Text] Field Tokenizing
All, I am at a bit of a loss here so any help would be greatly appreciated. I am using the DIH to grab data from a DB. The field that I am most interested in has anywhere from 1 word to several paragraphs worth of free text. What I would really like to do is pull out phrases like Joe's coffee shop rather than the 3 individual words. I have tried the KeywordTokenizerFactory and that does seem to do what I want it to do but it is not actually tokenizing anything so it does what I want it to for the most part but it's not creating the tokens that I need for further analysis in apps like Mahout. We can play with the combination of tokenizers and filters all day long and see what the results are after a quick reindex. I typlically just view them in Solitas as facets which may be the problem for me too. Does anyone have an example fieldType they can share with me that shows how to extract phrases if they are there from the data I described earlier. Am I even going about this the right way? I am using today's trunk build of Solr and here is what I have munged together this morning. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer charFilter class=solr.HTMLStripCharFilterFactory/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer /fieldType Thanks, Adam
Re: [Mahout] Integration with Solr
Thanks for the reply, Tommaso! I would like to see tighter integration like in the way Nutch integrates with Solr. There is a single param that you set which points to the Solr instance. My interest in Mahout is with it's abitlity to handle large data and find frequency, co-location of data, clustering, etc...All the algorithms that are in the core build are great and I am just now wrapping my head around how to use them all. Adam On Thu, Jun 9, 2011 at 10:33 AM, Tommaso Teofili tommaso.teof...@gmail.comwrote: Hello Adam, I've managed to create a small POC of integrating Mahout with Solr for a clustering task, do you want to use it for clustering only or possibly for other purposes/algorithms? More generally speaking, I think it'd be nice if Solr could be extended with a proper API for integrating clustering engines in it so that one can plug and exchange engines flawlessly (just need an Adapter). Regards, Tommaso 2011/6/9 Adam Estrada estrada.adam.gro...@gmail.com Has anyone integrated Mahout with Solr? I know that Carrot2 is part of the core build but the docs say that it's not very good for very large indexes. Anyone have thoughts on this? Thanks, Adam
Re: [Free Text] Field Tokenizing
Erick, I totally understand that BUT the keyword tokenizer factory does a really good job extracting phrases (or what look like phrases from) from my data. I don't know why exactly but it does do it. I am going to continue working through it to see if I can't figure it out ;-) Adam On Thu, Jun 9, 2011 at 12:26 PM, Erick Erickson erickerick...@gmail.comwrote: The problem here is that none of the built-in filters or tokenizers have a prayer of recognizing what #you# think are phrases, since it'll be unique to your situation. If you have a list of phrases you care about, you could substitute a single token for the phrases you care about... But the overriding question is what determines a phrase you're interested in? Is it a list or is there some heuristic you want to apply? Or could you just recognize them at query time and make them into a literal phrase (i.e. with quotationmarks)? Best Erick On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada estrada.adam.gro...@gmail.com wrote: All, I am at a bit of a loss here so any help would be greatly appreciated. I am using the DIH to grab data from a DB. The field that I am most interested in has anywhere from 1 word to several paragraphs worth of free text. What I would really like to do is pull out phrases like Joe's coffee shop rather than the 3 individual words. I have tried the KeywordTokenizerFactory and that does seem to do what I want it to do but it is not actually tokenizing anything so it does what I want it to for the most part but it's not creating the tokens that I need for further analysis in apps like Mahout. We can play with the combination of tokenizers and filters all day long and see what the results are after a quick reindex. I typlically just view them in Solitas as facets which may be the problem for me too. Does anyone have an example fieldType they can share with me that shows how to extract phrases if they are there from the data I described earlier. Am I even going about this the right way? I am using today's trunk build of Solr and here is what I have munged together this morning. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer charFilter class=solr.HTMLStripCharFilterFactory/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer /fieldType Thanks, Adam
[Visualizations] from Query Results
Dear Solr experts, I am curious to learn what visualization tools are out there to help me visualize my query results. I am not talking about a language specific client per se but something more like Carrot2 which breaks clusters in to their knowledge tree and expandable pie chart. Sorry if those aren't the correct names for those tools ;-) Anyway, what else is out there like Carrot2 http://project.carrot2.org/ to help me visualize Solr query results? Thanks for your input, Adam
Re: [Visualizations] from Query Results
Otis and Erick, Believe it or not, I did Google this and didn't come up with anything all that useful. I was at the Lucene Revolution conference last year and saw some prezos that had some sort of graphical representation of the query results. The one from Basic Tech especially caught my attention because it simply showed a graph of hits over time. I can do that using jQuery or Raphael as he suggested. I have also been playing with the Carrot2 visualization tools which are pretty cool too which is why I pointed them out in my original email. I was just curious to see if there were any speciality type projects out there like Carrot2 that folks in the Solr community are using. Adam On Fri, Jun 3, 2011 at 9:42 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Adam, Try this: http://lmgtfy.com/?q=search%20results%20visualizations In practice I find that visualizations are cool and attractive looking, but often text is more useful because it's more direct. But there is room for graphical representation of search results, sure. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Adam Estrada estrada.adam.gro...@gmail.com To: solr-user@lucene.apache.org Sent: Fri, June 3, 2011 7:13:39 AM Subject: [Visualizations] from Query Results Dear Solr experts, I am curious to learn what visualization tools are out there to help me visualize my query results. I am not talking about a language specific client per se but something more like Carrot2 which breaks clusters in to their knowledge tree and expandable pie chart. Sorry if those aren't the correct names for those tools ;-) Anyway, what else is out there like Carrot2 http://project.carrot2.org/ to help me visualize Solr query results? Thanks for your input, Adam
GeoJSON Response Writer
All, Has anyone modified the current json response writer to include the GeoJSON geospatial encoding standard. See here: http://geojson.org/ Just curious... Adam
Re: Solr: Images, Docs and Binary data
Well...by default there is a pretty decent schema that you can use as a template in the example project that builds with Solr. Tika is the library that does the actual content extraction so it would be a good idea to try the example project out first. Adam 2011/4/6 Ezequiel Calderara ezech...@gmail.com Another question that maybe is easier to answer, how can i store binary data? Any example schema? 2011/4/6 Ezequiel Calderara ezech...@gmail.com Hello everyone, i need to know if some has used solr for indexing and storing images (upt to 16MB) or binary docs. How does solr behaves with this type of docs? How affects performance? Thanks Everyone -- __ Ezequiel. Http://www.ironicnet.com -- __ Ezequiel. Http://www.ironicnet.com
Re: dataimport
Brian, I had the same problem a while back and set the JAVA_OPTS env variable to something my machine could handle. That may also be an option for you going forward. Adam On Wed, Mar 9, 2011 at 9:33 AM, Brian Lamb brian.l...@journalexperts.com wrote: This has since been fixed. The problem was that there was not enough memory on the machine. It works just fine now. On Tue, Mar 8, 2011 at 6:22 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : INFO: Creating a connection for entity id with URL: : jdbc:mysql://localhost/researchsquare_beta_library?characterEncoding=UTF8zeroDateTimeBehavior=convertToNull : Feb 24, 2011 8:58:25 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 : call : INFO: Time taken for getConnection(): 137 : Killed : : So it looks like for whatever reason, the server crashes trying to do a full : import. When I add a LIMIT clause on the query, it works fine when the LIMIT : is only 250 records but if I try to do 500 records, I get the same message. ...wow. that's ... weird. I've never seen a java process just log Killed like that. The only time i've ever seen a process log Killed is if it was terminated by the os (ie: kill -9 pid) What OS are you using? how are you running solr? (ie: are you using the simple jetty example java -jar start.jar or are you using a differnet servlet container?) ... are you absolutely certain your machine doens't have some sort of monitoring in place that kills jobs if they take too long, or use too much CPU? -Hoss
Re: Tomcat EXE Source Code
Some of these links may help... http://www.google.com/search?client=safarirls=enq=apache+tomcat+downloadie=UTF-8oe=UTF-8 Adam On Feb 25, 2011, at 3:16 AM, rajini maski wrote: Can anybody help me to get the source code of the Tomcat exe file i.e, source code of the installation exe . Thanks..
Re: Datetime problems with dataimport
I logged an issue in Jira that relates to this and it looks like Yonik picked it up. https://issues.apache.org/jira/browse/SOLR-2286 Adam On Feb 22, 2011, at 9:07 AM, MOuli wrote: Ok i got it. It should look like -mm-ddThh:mm:ssZ for example: 2011-02-22T15:07:00Z -- View this message in context: http://lucene.472066.n3.nabble.com/Datetime-problems-with-dataimport-tp2545654p2552477.html Sent from the Solr - User mailing list archive at Nabble.com.
[Solr] and CouchDB
I am curious to see if anyone has messed around with Solr and the Couch-Lucene incarnation that is out there...I was passed this article this morning and it really opened my eyes about CouchDB http://m.readwriteweb.com/hack/2011/02/hacker-chat-max-ogden.php Thoughts, Adam
Re: Indexing AutoCAD files
Hi Vignesh, I believe that you would have to incorporate GDAL in to Tika in order to read the file and extract the proper metadata. This is entirely doable but I don't know how to do it. There are companies out there that specialize in this sort of thing so hopefully, one of them has already contacted you outside of this list but I would love to see some detailed instruction on how to integrate GDAL in to Tika. Best of luck, Adam On Sat, Feb 19, 2011 at 12:31 AM, Vignesh Raj vignesh...@greatminds.co.in wrote: Hi team, Is there a way lucene can index AutoCAD files - *.dwg files? If so, please let me know. Can you please provide some insight on the same? Thanks in advance.. Regards Vignesh
Re: Indexing AutoCAD files
Hi Vignesh, I believe that you would have to incorporate GDAL in to Tika in order to read the file and extract the proper metadata. This is entirely doable but I don't know how to do it. There are companies out there that specialize in this sort of thing so hopefully, one of them has already contacted you outside of this list but I would love to see some detailed instruction on how to integrate GDAL in to Tika. Best of luck, Adam On Sat, Feb 19, 2011 at 12:31 AM, Vignesh Raj vignesh...@greatminds.co.in wrote: Hi team, Is there a way lucene can index AutoCAD files - *.dwg files? If so, please let me know. Can you please provide some insight on the same? Thanks in advance.. Regards Vignesh
Re: Index Autocad
I think you may have already posted this same question but please check VoyagerGIS out. They have some shit-hot software that is geared specifically towards the archive and retrieval of geospatial data. I suggest that you check it out!!! w/r, Adam On Sat, Feb 19, 2011 at 2:33 AM, lucene lucene luc...@greatminds.co.in wrote: Hi team, Is there a way lucene can index AutoCAD files – “*.dwg” files? If so, please let me know. Can you please provide some insight on the same? Thanks in advance.. Regards Vignesh
Re: Difference between Solr and Lucidworks distribution
I believe that the Lucid Works distro for Solr is free and as you mentioned they only appear to sell their services for it. I have used that version for several demos because it does seem to have all the bells and whistles already included and it's super easy to set up. The only downside in my case is that they are still on the official release version 1.4.1 which has an older version of PDFBox that doesn't parse PDF's generated from newer adobe software. Thanks Adobe ;-) It's easy enough to just rebuild Tika, PDFBox, FontBox, etc. and swap them out...If you want spatial support, you can use the plugin from the Spatial Solr project out of the Netherlands which is designed to support 1.4.1 and from what I can tell seems to work pretty well. Anyway, when 4.0 is released, hopefully with the extended spatial support from projects like SIS and JTS, I hope to see the office distro version change from Lucid. Thanks for all hard work the Lucid Team has provided over the years! Adam On Feb 12, 2011, at 10:55 PM, Andy wrote: Now I'm confused. In http://www.lucidimagination.com/lwe/subscriptions-and-pricing, the price of LucidWorks Enterprise Software is stated as FREE. I thought the price for Production was for the support service, not for the software. But you seem to be saying that 'LucidWorks Enterprise' is a separate software that isn't free. Did I misunderstand? --- On Sat, 2/12/11, Lance Norskog goks...@gmail.com wrote: From: Lance Norskog goks...@gmail.com Subject: Re: Difference between Solr and Lucidworks distribution To: solr-user@lucene.apache.org, markus.jel...@openindex.io Date: Saturday, February 12, 2011, 8:10 PM There are two distributions. The company is Lucid Imagination. 'Lucidworks for Solr' is the certified distribution of Solr 1.4.1, with several enhancements. Markus refers to 'LucidWorks Enterprise', which is LWE. This is a separate app with tools and a REST API for managing a Solr instance. Lance Norskog On Fri, Feb 11, 2011 at 8:36 AM, Markus Jelsma markus.jel...@openindex.io wrote: It is not free for production environments. http://www.lucidimagination.com/lwe/subscriptions-and-pricing On Friday 11 February 2011 17:31:22 Greg Georges wrote: Hello all, I just started watching the webinars from Lucidworks, and they mention their distribution which has an installer, etc.. Is there any other differences? Is it a good idea to use this free distribution? Greg -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 -- Lance Norskog goks...@gmail.com
Re: [WKT] Spatial Searching
Grant, How could i stub this out not being a java guy? What is needed in order to do this? Licensing is always going to be an issue with JTS which is why I am interested in the project SIS sitting in incubation right now. I'm willing to put forth the effort if I had a little direction on how to implement it from the peanut gallery ;-) Adam On Feb 9, 2011, at 7:03 AM, Grant Ingersoll wrote: The show stopper for JTS is it's license, unfortunately. Otherwise, I think it would be done already! We could, since it's LGPL, make it an optional dependency, assuming someone can stub it out. On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote: I just came across a ~nudge post over in the SIS list on what the status is for that project. This got me looking more in to spatial mods with Solr4.0. I found this enhancement in Jira. https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David mentions that he's already integrated JTS in to Solr4.0 for querying on polygons stored as WKT. It's relatively easy to get WKT strings in to Solr but does the Field type exist yet? Is there a patch or something that I can test out? Here's how I would do it using GDAL/OGR and the already existing csv update handler. http://www.gdal.org/ogr/drv_csv.html ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT This converts a shapefile to a csv with the geometries in tact in the form of WKT. You can then get the data in to Solr by running the following command. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8; There are lots of flavors of geometries so I suspect that this will be a daunting task but because JTS recognizes each geometry type it should be possible to work with them. Does anyone know of a patch or even when this functionality might be included in to Solr4.0? I need to query for polygons ;-) Thanks, Adam -- Grant Ingersoll http://www.lucidimagination.com/
Re: [WKT] Spatial Searching
Thought I would share this on web mapping...it's a great write up and something to consider when talking about working with spatial data. http://www.tokumine.com/2010/09/20/gis-data-payload-sizes/ Adam On Feb 9, 2011, at 7:03 AM, Grant Ingersoll wrote: The show stopper for JTS is it's license, unfortunately. Otherwise, I think it would be done already! We could, since it's LGPL, make it an optional dependency, assuming someone can stub it out. On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote: I just came across a ~nudge post over in the SIS list on what the status is for that project. This got me looking more in to spatial mods with Solr4.0. I found this enhancement in Jira. https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David mentions that he's already integrated JTS in to Solr4.0 for querying on polygons stored as WKT. It's relatively easy to get WKT strings in to Solr but does the Field type exist yet? Is there a patch or something that I can test out? Here's how I would do it using GDAL/OGR and the already existing csv update handler. http://www.gdal.org/ogr/drv_csv.html ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT This converts a shapefile to a csv with the geometries in tact in the form of WKT. You can then get the data in to Solr by running the following command. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8; There are lots of flavors of geometries so I suspect that this will be a daunting task but because JTS recognizes each geometry type it should be possible to work with them. Does anyone know of a patch or even when this functionality might be included in to Solr4.0? I need to query for polygons ;-) Thanks, Adam -- Grant Ingersoll http://www.lucidimagination.com/
Re: Architecture decisions with Solr
I tried the multi-core route and it gets too complicated and cumbersome to maintain. That is just from my own personal testing...It was suggested that each user have their own ID in a single index that you can query against accordingly. In the example schema.xml I believe there is a field called texttight or something like that that is meant for skew numbers. Give each user their own guid or md5 hash and add that as part of all your queries. That way, only their data are returned. It would be the equivalent of something like this... SELECT * FROM mytable WHERE userid = '3F2504E04F8911D39A0C0305E82C3301' AND Grant Ingersoll gave a presentation at the Lucene Revolution conference that demonstrated that you can build a query to be as easy or as complicated as any SQL statement. Maybe he can share that PPT? Adam On Feb 9, 2011, at 2:47 PM, Sujit Pal wrote: Another option (assuming the case where a user can be granted access to a certain class of documents, and more than one user would be able to access certain documents) would be to store the access filter (as an OR query of content types) in an external cache (perhaps a database or an eternal cache that the database changes are published to periodically), then using this access filter as a facet on the base query. -sujit On Wed, 2011-02-09 at 14:38 -0500, Glen Newton wrote: This application will be built to serve many users If this means that you have thousands of users, 1000s of VMs and/or 1000s of cores is not going to scale. Have an ID in the index for each user, and filter using it. Then they can see only their own documents. Assuming that you are building an app that through which they authenticate talks to solr . (i.e. all requests are filtered using their ID) -Glen On Wed, Feb 9, 2011 at 2:31 PM, Greg Georges greg.geor...@biztree.com wrote: From what I understand about multicore, each of the indexes are independant from each other right? Or would one index have access to the info of the other? My requirement is like you mention, a client has access only to his or her search data based in their documents. Other clients have no access to the index of other clients. Greg -Original Message- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: 9 février 2011 14:28 To: solr-user@lucene.apache.org Subject: Re: Architecture decisions with Solr What about standing up a VM (search appliance that you would make) for each client? If there's no data sharing across clients, then using the same solr server/index doesn't seem necessary. Solr will easily meet your needs though, its the best there is. On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote: Hello all, I am looking into an enterprise search solution for our architecture and I am very pleased to see all the features Solr provides. In our case, we will have a need for a highly scalable application for multiple clients. This application will be built to serve many users who each will have a client account. Each client will have a multitude of documents to index (0-1000s of documents). After discussion we were talking about going multicore and to have one index file per client account. The reason for this is that security is achieved by having a separate index for each client etc.. Is this the best approach? How feasible is it (dynamically create indexes on client account creation. Is it better to go the faceted search capabilities route? Thanks for your help Greg
[WKT] Spatial Searching
I just came across a ~nudge post over in the SIS list on what the status is for that project. This got me looking more in to spatial mods with Solr4.0. I found this enhancement in Jira. https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David mentions that he's already integrated JTS in to Solr4.0 for querying on polygons stored as WKT. It's relatively easy to get WKT strings in to Solr but does the Field type exist yet? Is there a patch or something that I can test out? Here's how I would do it using GDAL/OGR and the already existing csv update handler. http://www.gdal.org/ogr/drv_csv.html ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT This converts a shapefile to a csv with the geometries in tact in the form of WKT. You can then get the data in to Solr by running the following command. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8; There are lots of flavors of geometries so I suspect that this will be a daunting task but because JTS recognizes each geometry type it should be possible to work with them. Does anyone know of a patch or even when this functionality might be included in to Solr4.0? I need to query for polygons ;-) Thanks, Adam
Re: Time fields
If your using a DIH you can configure it however you want. Here is a snippet of my code. Note the DateTimeTransformer. dataConfig dataSource type=JdbcDataSource name=bleh driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver://localhost;databaseName=bleh;responseBuffering=adaptive; user=test password=test onError=skip/ document entity name=Entities dataSource=JIEE transformer=DateFormatTransformer query = SELECT EntityUID AS id, EntityType AS cat, EntityUIDParent AS pid, subject AS subject, summary AS summary, DateCreated AS eventdate, Latitude AS lat, Longitude AS lng, Type AS jtype, SupportCategory AS supcat, Cause AS cause, Status AS status, Urgency AS urgency, Priority AS priority, Coordinate AS coords FROM dbo.JIEESearchIndex field column=id name=id / field column=cat name=cat / field column=subject name=subject / field column=summary name=summary / field column=eventdate name=eventdate dateTimeFormat=-MM-dd'T'HH:mm:ss.SSS'Z'/ field column=lat name=lat / field column=lng name=lng / field column=coords name=coords / field column=jtype name=jtype / field column=supcat name=supcat / field column=cause name=cause / field column=status name=status / field column=urgency name=urgency / /entity On Wed, Feb 2, 2011 at 7:28 PM, Dennis Gearon gear...@sbcglobal.net wrote: For time of day fields, NOT unix timestamp/dates, what is the best way to do that? I can think of seconds since beginning of day as integer OR string Any other ideas? Assume that I'll be using range queries. TIA. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
[Failure] to start Solr 4.0
All, I've checked out the latest code and built the root directory with ant compile and then I build the solr directory again using the ant dist command which gives me the lucene-libs directory and a couple others. Now Solr won't start. What am I missing??? This is as far as it gets. mini:example Adam$ java -jar start.jar 2011-01-28 17:14:23.402:INFO::Logging to STDERR via org.mortbay.log.StdErrLog 2011-01-28 17:14:23.605:INFO::jetty-6.1.26 2011-01-28 17:14:23.638:INFO::Started SocketConnector@0.0.0.0:8983 What couple possibly be the problem? Adam
Re: [Failure] to start Solr 4.0
I found the problem...You HAVE to build the Solr directory using ant example in order for the web application to start properly. Sorry to post so many times. Adam On Jan 28, 2011, at 5:20 PM, Adam Estrada wrote: All, I've checked out the latest code and built the root directory with ant compile and then I build the solr directory again using the ant dist command which gives me the lucene-libs directory and a couple others. Now Solr won't start. What am I missing??? This is as far as it gets. mini:example Adam$ java -jar start.jar 2011-01-28 17:14:23.402:INFO::Logging to STDERR via org.mortbay.log.StdErrLog 2011-01-28 17:14:23.605:INFO::jetty-6.1.26 2011-01-28 17:14:23.638:INFO::Started SocketConnector@0.0.0.0:8983 What couple possibly be the problem? Adam
Re: Tika config in ExtractingRequestHandler
I believe that as along as Tika is included in a folder that is referenced by solrconfig.xml you should be good. Solr will automatically throw mime types to Tika for parsing. Can anyone else add to this? Thanks, Adam On Thu, Jan 27, 2011 at 5:06 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: The wiki page for the ExtractingRequestHandler says that I can add the following configuration: str name=tika.config/my/path/to/tika.config/str I have tried to google for an example of such a Tika config file, but haven't found anything. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: DIH From various File system locations
There are a few tutorials out there. 1. http://wiki.apache.org/nutch/RunningNutchAndSolr (not the most practical) 2. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ (similar to 1.) 3. Build the latest from branch http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ and read this one. http://www.adamestrada.com/2010/04/24/web-crawling-with-nutch/ but add the solr parameter at the end bin/nutch crawl urls -depth 5 -topN 100 -solr http://localhost:8983/solr This will automatically add the data nutch collected to Solr. For larger files I would also increase your JAVA_OPTS env to something like JAVA_OPTS=' Xmx2048m' Adam On Tue, Jan 25, 2011 at 11:41 AM, pankaj bhatt panbh...@gmail.com wrote: Thanks Adam, It seems like Nutch use to solve most of my concerns. i would be great if you can have share resources for Nutch with us. / Pankaj Bhatt. On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups estrada.adam.gro...@gmail.com wrote: I would just use Nutch and specify the -solr param on the command line. That will add the extracted content your instance of solr. Adam Sent from my iPhone On Jan 25, 2011, at 5:29 AM, pankaj bhatt panbh...@gmail.com wrote: Hi All, I need to index the documents presents in my file system at various locations (e.g. C:\docs , d:\docs ). Is there any way through which i can specify this in my DIH Configuration. Here is my configuration:- document entity name=sd processor=FileListEntityProcessor fileName=docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$ *baseDir=G:\\Desktop\\* recursive=false rootEntity=true transformer=DateFormatTransformer onerror=continue entity name=tikatest processor=org.apache.solr.handler.dataimport.TikaEntityProcessor url=${sd.fileAbsolutePath} format=text dataSource=bin field column=Author name=author meta=true/ field column=Content-Type name=title meta=true/ !-- field column=title name=title meta=true/ -- field column=text name=all_text/ /entity !-- field column=fileLastModified name=date dateTimeFormat=-MM-dd'T'hh:mm:ss / -- field column=fileSize name=size/ field column=file name=filename/ /entity !--baseDir=../site-- /document / Pankaj Bhatt.
Re: DIH From various File system locations
I take that back...Use am currently using version 1.2 and make sure that the latest versions of Tika and PDFBox is in the contrib folder. 1.3 is structured a bit differently and it doesn't look like there is a contrib directory. Maybe one of the Nutch contributors can comment on this? Adam On Tue, Jan 25, 2011 at 3:21 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: There are a few tutorials out there. 1. http://wiki.apache.org/nutch/RunningNutchAndSolr (not the most practical) 2. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ (similar to 1.) 3. Build the latest from branch http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ and read this one. http://www.adamestrada.com/2010/04/24/web-crawling-with-nutch/ but add the solr parameter at the end bin/nutch crawl urls -depth 5 -topN 100 -solr http://localhost:8983/solr This will automatically add the data nutch collected to Solr. For larger files I would also increase your JAVA_OPTS env to something like JAVA_OPTS=' Xmx2048m' Adam On Tue, Jan 25, 2011 at 11:41 AM, pankaj bhatt panbh...@gmail.com wrote: Thanks Adam, It seems like Nutch use to solve most of my concerns. i would be great if you can have share resources for Nutch with us. / Pankaj Bhatt. On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups estrada.adam.gro...@gmail.com wrote: I would just use Nutch and specify the -solr param on the command line. That will add the extracted content your instance of solr. Adam Sent from my iPhone On Jan 25, 2011, at 5:29 AM, pankaj bhatt panbh...@gmail.com wrote: Hi All, I need to index the documents presents in my file system at various locations (e.g. C:\docs , d:\docs ). Is there any way through which i can specify this in my DIH Configuration. Here is my configuration:- document entity name=sd processor=FileListEntityProcessor fileName=docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$ *baseDir=G:\\Desktop\\* recursive=false rootEntity=true transformer=DateFormatTransformer onerror=continue entity name=tikatest processor=org.apache.solr.handler.dataimport.TikaEntityProcessor url=${sd.fileAbsolutePath} format=text dataSource=bin field column=Author name=author meta=true/ field column=Content-Type name=title meta=true/ !-- field column=title name=title meta=true/ -- field column=text name=all_text/ /entity !-- field column=fileLastModified name=date dateTimeFormat=-MM-dd'T'hh:mm:ss / -- field column=fileSize name=size/ field column=file name=filename/ /entity !--baseDir=../site-- /document / Pankaj Bhatt.
Re: Indexing spatial columns
Hi MapButcher, There are a couple things that are going on here. 1. The spatial functionality is confusing between versions of Solr. I wish someone would update the solr Spatial Search wiki page. 2. You will want to use the jTDS Driver here instead of the one from Microsoft. http://jtds.sourceforge.net/ It works a little better. 3. For Solr 4.0 you will basically have to concatenate the lat/long fields in to a single column which in the example schema is called store 4. I don't know if individual columns actually exist for latitude and longitude in 4.0 but in 1.4.x I know the lat/long type HAD to be called lat and lng and had to be tdouble type which I see below. 5. Revert back to Solr 1.4.x and try using their plugin http://www.jteam.nl/news/spatialsolr.html 6. Try your queries in the Solr admin tool first before trying to integrate this in to your code. Overall, I have had great success with Solr Spatial in just doing a simple radius search. I am using the core 4.0 functionality and am having no problems. I will eventually get in to distance and bounding box queries do ehstever you figure out and share would be great! Good luck, Adam On Jan 24, 2011, at 4:46 AM, mapbutcher wrote: Hi, I'm a bit of a solr beginner. I have installed Solr 4.0 and I'm trying to index some spatial data stored in a sql server instance. I'm using the DataImportHandler here is my data-comfig.xml: dataConfig dataSource type=JdbcDataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://localhost\sqlserver08;databaseName=Spatial user=sa password=sqlserver08/ document entity name=poi query=select OBJECTID,CATEGORY,NAME,POINT_X,POINT_Y from NZ_POI field column=OBJECTID name=id/ field column=CATEGORY name=category/ field column=NAME name=name/ field column=POINT_X name=lat/ field column=POINT_Y name=lon/ /entity /document /dataConfig In my schema file I have following definition: field name=category type=string indexed=true stored=true/ field name=name type=string indexed=true stored=true/ field name=lat type=tdouble indexed=true stored=true/ field name=lon type=tdouble indexed=true stored=true/ copyField source=category dest=text/ copyField source=name dest=text/ I have completed a data import with no errors in the log as far as i can tell. However when i inspect the schema i do not see the columns names lat\lon. When sending the query: http://localhost:8080/Solr/select/?q=Camp AND _val_:recip(dist(2, lon, lat, 44.794, -93.2696), 1, 1, 0)^100 I get an error undefined column. Does anybody have any ideas about whether the above is the correct procedure for indexing spatial data? Cheers S -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-spatial-columns-tp2318493p2318493.html Sent from the Solr - User mailing list archive at Nabble.com.
[Building] Solr4.0 on Windows
All, I am having problems building Solr trunk on my windows 7 machine. I get the following errors... BUILD FAILED C:\Apache\Solr-Nightly\build.xml:23: The following error occurred while executin g this line: C:\Apache\Solr-Nightly\lucene\common-build.xml:529: The following error occurred while executing this line: C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed! The following error occurred while executing this line: C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed! The following error occurred while executing this line: C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed! The following error occurred while executing this line: C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed! The following error occurred while executing this line: C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed! I am full admin on my machine and made sure that I was running the build as admin but it still fails. I just tired the same thing on the Mac and ran it as sudo and it built perfectly. Any ideas? Thanks, Adam
Re: Indexing FTP Documents through SOLR??
+1 on Nutch! On Fri, Jan 21, 2011 at 4:11 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, Please take a look at Apache Nutch. I can crawl through a file system over FTP. After crawling, it can use Tika to extract the content from your PDF files and other. Finally you can then send the data to your Solr server for indexing. http://nutch.apache.org/ Hi All, Is there is any way in SOLR or any plug-in through which the folders and documents in FTP location can be indexed. / Pankaj Bhatt.
Re: [Building] Solr4.0 on Windows
So I did manage to get this to build... ant compile does it. Didn't it used to use straight Maven? It's pretty hard to keep track of what's what...Anyway, is there any way/reason all the cool Lucene jars aren't getting copied in to $SOLR_HOME/lib? That would really help and save a lot of time. Where in the build script would I need to change this? Thanks, Adam On Jan 23, 2011, at 9:31 PM, Adam Estrada wrote: All, I am having problems building Solr trunk on my windows 7 machine. I get the following errors... BUILD FAILED C:\Apache\Solr-Nightly\build.xml:23: The following error occurred while executin g this line: C:\Apache\Solr-Nightly\lucene\common-build.xml:529: The following error occurred while executing this line: C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed! The following error occurred while executing this line: C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed! The following error occurred while executing this line: C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed! The following error occurred while executing this line: C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed! The following error occurred while executing this line: C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed! I am full admin on my machine and made sure that I was running the build as admin but it still fails. I just tired the same thing on the Mac and ran it as sudo and it built perfectly. Any ideas? Thanks, Adam
Re: Solr Out of Memory Error
Is anyone familiar with the environment variable, JAVA_OPTS? I set mine to a much larger heap size and never had any of these issues again. JAVA_OPTS = -server -Xms4048m -Xmx4048m Adam On Wed, Jan 19, 2011 at 3:29 AM, Isan Fulia isan.fu...@germinait.com wrote: Hi all, By adding more servers do u mean sharding of index.And after sharding , how my query performance will be affected . Will the query execution time increase. Thanks, Isan Fulia. On 19 January 2011 12:52, Grijesh pintu.grij...@gmail.com wrote: Hi Isan, It seems your index size 25GB si much more compared to you have total Ram size is 4GB. You have to do 2 things to avoid Out Of Memory Problem. 1-Buy more Ram ,add at least 12 GB of more ram. 2-Increase the Memory allocated to solr by setting XMX values.at least 12 GB allocate to solr. But if your all index will fit into the Cache memory it will give you the better result. Also add more servers to load balance as your QPS is high. Your 7 Laks data makes 25 GB of index its looking quite high.Try to lower the index size What are you indexing in your 25GB of index? - Thanx: Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Out-of-Memory-Error-tp2280037p2285779.html Sent from the Solr - User mailing list archive at Nabble.com. -- Thanks Regards, Isan Fulia.
Re: boilerpipe solr tika howto please
Is there a drastic difference between this and TagSoup which is already included in Solr? On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat arnaud.gaudi...@gmail.comwrote: Hello, I would like to use BoilerPipe (a very good program which cleans the html content from surplus clutter). I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from solr, am I right? How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml ( with org.apache.solr.handler.extraction.ExtractingRequestHandler)? Or do I need to modify some code inside Solr? I so something like TikaCLI -F in the tika forum ( http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration) is it the right way? Thanks in advance, Arno.
Re: Multi-word exact keyword case-insensitive search suggestions
Hi, the following seems to work pretty well. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false / /analyzer /fieldType !-- A text field that uses WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of wifi or wi fi could match a document containing Wi-Fi. Synonyms and stopwords are customized by external files, and stemming is enabled. The attribute autoGeneratePhraseQueries=true (the default) causes words that get split to form phrase queries. For example, WordDelimiterFilter splitting text:pdp-11 will cause the parser to generate text:pdp 11 rather than (text:PDP OR text:11). NOTE: autoGeneratePhraseQueries=true tends to not work well for non whitespace delimited languages. -- fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType copyField source=cat dest=text/ copyField source=subject dest=text/ copyField source=summary dest=text/ copyField source=cause dest=text/ copyField source=status dest=text/ copyField source=urgency dest=text/ I ingest the source fields as text_ws (I know I've changed it a bit) and then copy the field to text. This seems to do what you are asking for. Adam On Thu, Jan 13, 2011 at 12:05 AM, Chamnap Chhorn chamnapchh...@gmail.comwrote: Hi all, I'm just stuck with exact keyword for several days. Hope you guys could help me. Here is the scenario: 1. It need to be matched with multi-word keyword and case insensitive 2. Partial word or single word matching with this field is not allowed I want to know the field type definition for this field and sample solr query. I need to combine this search with my full text search which uses dismax query. Thanks -- Chhorn Chamnap http://chamnapchhorn.blogspot.com/
[sfield] Missing in Spatial Search
According to the documentation here: http://wiki.apache.org/solr/SpatialSearch the field that identifies the spatial point data is sfield. See the console output below. Jan 13, 2011 6:49:40 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/select params={spellcheck=truef.jtype.facet.mincoun t=1facet=truef.cat.facet.mincount=1f.cause.facet.mincount=1f.urgency.facet.m incount=1rows=10start=0q=*:*f.status.facet.mincount=1facet.field=catfacet. field=jtypefacet.field=statusfacet.field=causefacet.field=urgency?=fq={!typ e%3Dgeofilt+pt%3D39.0914154052734,-84.517822265625+sfield%3Dcoords+d%3D300}text: } hits=113 status=0 QTime=1 Jan 13, 2011 6:51:51 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: missing sfield for spatial reques t Any ideas on this one? Thanks in advance, Adam
Re: Solr 4.0 = Spatial Search - How to
I believe this is what you are looking for. I renamed the field called store to coords in the schema.xml file. The tricky part is building out the query. I am using SolrNet to do this though and have not yet cracked the problem. http://localhost:8983/solr/select?q=*:*+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]fq={!bbox}sfield=coordspt=32.15,-93.85d=500 Adam On Wed, Jan 12, 2011 at 8:01 PM, caman aboxfortheotherst...@gmail.comwrote: Ok, this could be very easy to do but was not able to do this. Need to enable location search i.e. if someone searches for location 'New York' = show results for New York and results within 50 miles of New York. We do have latitude/longitude stored in database for each record but not sure how to index these values to enable spatial search. Any help would be much appreciated. thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2245592.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.0 = Spatial Search - How to
Actually, I by looking at the results from the geofilt filter it would appear that it's not giving me the results I'm looking for. Or maybe it is...I need to convert my results to KML to see if it is actually performing a proper radius query. http://localhost:8983/solr/select?q=*:*fq={!geofilt%20pt=39.0914154052734,-84.517822265625%20sfield=coords%20d=5000}http://localhost:8983/solr/select?q=*:*+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]fq={!geofilt%20pt=32.15,-93.85%20sfield=coords%20d=5000} http://localhost:8983/solr/select?q=*:*+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]fq={!geofilt%20pt=32.15,-93.85%20sfield=coords%20d=5000}Please let me know what you find. Adam On Wed, Jan 12, 2011 at 8:24 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: I believe this is what you are looking for. I renamed the field called store to coords in the schema.xml file. The tricky part is building out the query. I am using SolrNet to do this though and have not yet cracked the problem. http://localhost:8983/solr/select?q=*:*+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]fq={!bbox}sfield=coordspt=32.15,-93.85d=500http://localhost:8983/solr/select?q=*:*+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]fq=%7B!bbox%7Dsfield=coordspt=32.15,-93.85d=500 Adam On Wed, Jan 12, 2011 at 8:01 PM, caman aboxfortheotherst...@gmail.comwrote: Ok, this could be very easy to do but was not able to do this. Need to enable location search i.e. if someone searches for location 'New York' = show results for New York and results within 50 miles of New York. We do have latitude/longitude stored in database for each record but not sure how to index these values to enable spatial search. Any help would be much appreciated. thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2245592.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.0 = Spatial Search - How to
In my case, I am getting data from a database and am able to concatenate the lat/long as a coordinate pair to store in my coords field. To test this, I randomized the lat/long values and generated about 6000 documents. Adam On Wed, Jan 12, 2011 at 8:29 PM, caman aboxfortheotherst...@gmail.comwrote: Adam, thanks. Yes that helps but how does coords fields get populated? All I have is field name=lat type=tdouble indexed=true stored=true / field name=lng type=tdouble indexed=true stored=true / field name=coord type=location indexed=true stored=true / fields 'lat' and 'lng' get populated by dataimporthandler but coord, am not sure? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2245709.html Sent from the Solr - User mailing list archive at Nabble.com.
[Example] Compound Queries
All, I have the following query which works just fine for querying a date range. Now I would like to add any kind of spatial query to the mix. Would someone be so kind as to help me out with an example spatial query that works in conjunction with my date range query? http://localhost:8983/solr/select/?q=hurricane+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]version=2.2start=0rows=10indent=on I think it's something like this but my results are a not correct http://localhost:8983/solr/select/?q=hurricane+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]sfield=storept=45.15,-93.85sort=geodist()%20ascversion=2.2start=0rows=10indent=on Your feedback is greatly appreciated! Adam
Re: DIH - Closing ResultSet in JdbcDataSource
This is my configuration which seems to work just fine. ?xml version=1.0 encoding=utf-8 ? dataConfig dataSource type=JdbcDataSource name=DBImport driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver://localhost;databaseName=50_DEV;responseBuffering=adaptive; user=test password=test onError=skip/ document From there it's just a matter of running the select statement and mapping it against the correct fields in your index. Adam On Fri, Jan 7, 2011 at 2:40 PM, Shane Perry thry...@gmail.com wrote: Hi, I am in the process of migrating our system from Postgres 8.4 to Solr 1.4.1. Our system is fairly complex and as a result, I have had to define 19 base entities in the data-config.xml definition file. Each of these entities executes 5 queries. When doing a full-import, as each entity completes, the server hosting Postgres shows 5 idle in transaction for the entity. In digging through the code, I found that the JdbcDataSource wraps the ResultSet object in a custom ResultSetIterator object, leaving the ResultSet open. Walking through the code I can't find a close() call anywhere on the ResultSet. I believe this results in the idle in transaction processes. Am I off base here? I'm not sure what the overall implications are of the idle in transaction processes, but is there a way I can get around the issue without importing each entity manually? Any feedback would be greatly appreciated. Thanks in advance, Shane
Re: [sqljdbc4.jar] Errors
I can't tell any difference in performance but it does work like a charm. At least the messaging in the console is a lot more verbose. Thank you very much for the heads up on this one ;-) Adam On Wed, Jan 5, 2011 at 4:29 AM, Gora Mohanty g...@mimirtech.com wrote: On Wed, Jan 5, 2011 at 10:18 AM, Estrada Groups estrada.adam.gro...@gmail.com wrote: I downloaded that driver today and will test it tomorrow. Thanks for the tip! Would you mind sending an XML code snippet if it's any different to load than the MS driver? [...] I presume that you are referring to the jTDS driver. The options are slightly different. Here is a snippet from the XML configuration of our DataImportHandler, with sensitive details obscured. dataSource type=JdbcDataSource name=jdbc driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver://db_server:port;databasename=dbname;responseBuffering=adaptive user=user password=password onError=skip / The jtds FAQ ( http://jtds.sourceforge.net/faq.html ) also has other configuration options, and more helpful information. For us, the transition was pretty painless. Regards, Gora
[sqljdbc4.jar] Errors
Can anyone help me with the following error. I upgraded my database to SQL Server 2008 SP2 and now I get the following error. It was working with SQL Server 2005. Solr Error Stack Caused by: java.lang.UnsupportedOperationException: Java Runtime Environment (JR E) version 1.6 is not supported by this driver. Use the sqljdbc4.jar class libra ry, which provides support for JDBC 4.0. Any tips on this would be great! Thanks, Adam
Re: [sqljdbc4.jar] Errors
I got the latest jar file from the MS website and then changed the authentication to Mixed Mode on my DB. That seems to have fixed it. My 2005 Server was Windows Authentication only and that worked so there are obviously quite a few differences between the versions of the DB. I learn something new every day Thanks for the feedback! Adam On Tue, Jan 4, 2011 at 10:20 PM, Lance Norskog goks...@gmail.com wrote: Do you get a new JDBC driver jar with 2008? Look around the distribution or the MS web site. On Tue, Jan 4, 2011 at 7:06 PM, pankaj bhatt panbh...@gmail.com wrote: Hi Adam, Can you try by downgrading your Java version to java 5. However i am using Java 6u13 with sqljdbc4.jar , i however do not get any error. If possible, can you pleease also try with some other version of Java 6. / Pankaj Bhatt. On Wed, Jan 5, 2011 at 5:01 AM, Adam Estrada estrada.adam.gro...@gmail.comwrote: Can anyone help me with the following error. I upgraded my database to SQL Server 2008 SP2 and now I get the following error. It was working with SQL Server 2005. Solr Error Stack Caused by: java.lang.UnsupportedOperationException: Java Runtime Environment (JR E) version 1.6 is not supported by this driver. Use the sqljdbc4.jar class libra ry, which provides support for JDBC 4.0. Any tips on this would be great! Thanks, Adam -- Lance Norskog goks...@gmail.com
Re: [Nutch] and Solr integration
All, I realize that the documentation says that you crawl first then add to Solr but I spent several hours running the same command through Cygwin with -solrindex http://localhost:8983/solr on the command line (eg. bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex http://localhost:8983/solr) and it worked. Does anyone know why it's not working for me anymore? I am using the Lucid build of Solr which was what i was using before. I neglected to write down the command line syntax which is biting me in the arse. Any tips on this one would be great! Thanks, Adam On Mon, Dec 20, 2010 at 4:21 PM, Anurag anurag.it.jo...@gmail.com wrote: why are using solrindex in the argument.? It is used when we need to index the crawled data in Solr For more read http://wiki.apache.org/nutch/NutchTutorial . Also for nutch-solr integration this is very useful blog http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ I integrated nutch and solr and it works well. Thanks On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com ml-node%2b2122347-622655030-146...@n3.nabble.comml-node%252b2122347-622655030-146...@n3.nabble.com wrote: All, I have a couple websites that I need to crawl and the following command line used to work I think. Solr is up and running and everything is fine there and I can go through and index the site but I really need the results added to Solr after the crawl. Does anyone have any idea on how to make that happen or what I'm doing wrong? These errors are being thrown fro Hadoop which I am not using at all. $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex ht tp://localhost:8983/solr crawl started in: crawl rootUrlDir = http://localhost:8983/solr threads = 10 depth = 100 indexer=lucene topN = 50 Injector: starting at 2010-12-20 15:23:25 Injector: crawlDb: crawl/crawldb Injector: urlDir: http://localhost:8983/solr Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375 ) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:169) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 81) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.Injector.inject(Injector.java:217) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) -- View message @ http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html To start a new topic under Solr - User, email ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com ml-node%2b472068-1941297125-146...@n3.nabble.comml-node%252b472068-1941297125-146...@n3.nabble.com To unsubscribe from Solr - User, click here http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY= . -- Kumar Anurag - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SpatialTierQueryParserPlugin Loading Error
No just yet, Grant...I have been sidetracked on a couple other things but I will keep you posted. Thanks for the response, Adam On Mon, Jan 3, 2011 at 10:22 AM, Grant Ingersoll gsing...@apache.orgwrote: Sorry, I just saw this, Adam. Were you able to get it working? On Dec 28, 2010, at 8:54 PM, Adam Estrada wrote: Hi Grant, I grabbed the latest version from trunk this morning and am still unable to get any of the spatial functionality to work. I still seem to be getting the class loading errors that I was getting when using the patches and jar files I found all over the web. What I really need at this point is an example of solrconfig.xml and whatever else I need to include to make it work properly. I am using the Geonames DB with valid lat/longs in decimal degrees so I'm confident that the data are correct. I have tried several examples all with the same results. There are other patches like the following that show snippets of how to modify the solrconfig file but there is no definitive source... https://issues.apache.org/jira/secure/attachment/12452781/SOLR-2077.Quach.Mattmann.082210.patch.txt I would gladly update this page if I could just get it working. http://wiki.apache.org/solr/SpatialSearch w/r, Adam On Tue, Dec 14, 2010 at 9:04 AM, Grant Ingersoll gsing...@apache.org wrote: For this functionality, you are probably better off using trunk or branch_3x. There are quite a few patches related to that particular one that you will need to apply in order to have it work correctly. On Dec 13, 2010, at 10:06 PM, Adam Estrada wrote: All, Can anyone shed some light on this error. I can't seem to get this class to load. I am using the distribution of Solr from Lucid Imagination and the Spatial Plugin from here https://issues.apache.org/jira/browse/SOLR-773. I don't know how to apply a patch but the jar file is in there. What else can I do? org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413) at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:435) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1498) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1492) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1525) at org.apache.solr.core.SolrCore.initQParsers(SolrCore.java:1442) at org.apache.solr.core.SolrCore.init(SolrCore.java:548) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594) at org.mortbay.jetty.servlet.Context.startContext(Context.java:139) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117) at org.mortbay.jetty.Server.doStart(Server.java:210) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.mortbay.start.Main.invokeMain(Main.java:183) at org.mortbay.start.Main.start(Main.java:497) at org.mortbay.start.Main.main(Main.java:115) Caused by: java.lang.ClassNotFoundException: org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method
Re: [Nutch] and Solr integration
BLEH! facepalm This is entirely possible to do in a single step AS LONG AS YOU GET THE SYNTAX CORRECT ;-) http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/ http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50* -solr* http://localhost:8983/solr http://localhost:8983/solrThe correct param is -solr NOT -solrindex. Cheers, Adam On Mon, Jan 3, 2011 at 11:45 AM, Adam Estrada estrada.a...@gmail.comwrote: All, I realize that the documentation says that you crawl first then add to Solr but I spent several hours running the same command through Cygwin with -solrindex http://localhost:8983/solr on the command line (eg. bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex http://localhost:8983/solr) and it worked. Does anyone know why it's not working for me anymore? I am using the Lucid build of Solr which was what i was using before. I neglected to write down the command line syntax which is biting me in the arse. Any tips on this one would be great! Thanks, Adam On Mon, Dec 20, 2010 at 4:21 PM, Anurag anurag.it.jo...@gmail.com wrote: why are using solrindex in the argument.? It is used when we need to index the crawled data in Solr For more read http://wiki.apache.org/nutch/NutchTutorial . Also for nutch-solr integration this is very useful blog http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ I integrated nutch and solr and it works well. Thanks On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com ml-node%2b2122347-622655030-146...@n3.nabble.comml-node%252b2122347-622655030-146...@n3.nabble.com wrote: All, I have a couple websites that I need to crawl and the following command line used to work I think. Solr is up and running and everything is fine there and I can go through and index the site but I really need the results added to Solr after the crawl. Does anyone have any idea on how to make that happen or what I'm doing wrong? These errors are being thrown fro Hadoop which I am not using at all. $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex ht tp://localhost:8983/solr crawl started in: crawl rootUrlDir = http://localhost:8983/solr threads = 10 depth = 100 indexer=lucene topN = 50 Injector: starting at 2010-12-20 15:23:25 Injector: crawlDb: crawl/crawldb Injector: urlDir: http://localhost:8983/solr Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375 ) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:169) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 81) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.Injector.inject(Injector.java:217) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) -- View message @ http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html To start a new topic under Solr - User, email ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com ml-node%2b472068-1941297125-146...@n3.nabble.comml-node%252b472068-1941297125-146...@n3.nabble.com To unsubscribe from Solr - User, click here http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY= . -- Kumar Anurag - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html Sent from the Solr - User mailing list archive at Nabble.com.
[DIH] and XML Namespaces
All, I am indexing some RSS feeds that are bound to specific namespaces. See below... dataConfig dataSource type=HttpDataSource encoding=UTF-8 connectionTimeout=50 readTimeout=50/ document entity name=filedatasource processor=FileListEntityProcessor baseDir=C:/Apache/Solr-Nightly/solr/example/solr/conf/dataimporthandler fileName=^.*xml$ recursive=true rootEntity=false dataSource=null entity name=CBP pk=link datasource=filedatasource url= http://ws.geonames.org/rssToGeoRSS?geoRSS=simpleamp;feedUrl=http://www.cbp.gov/xp/cgov/admin/rss/?rssUrl=/home.xml processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/item transformer=DateFormatTransformer,HTMLStripTransformer field column=source xpath=/rss/channel/title commonField=true / field column=source-link xpath=/rss/channel/link commonField=true / field column=subject xpath=/rss/channel/description commonField=true / field column=titlexpath=/rss/channel/item/title / field column=link xpath=/rss/channel/item/link / field column=description xpath=/rss/channel/item/description stripHTML=true / field column=creator xpath=/rss/channel/item/dc:creator / field column=item-subject xpath=/rss/channel/item/subject / field column=author xpath=/rss/channel/item/author / field column=comments xpath=/rss/channel/item/comments / field column=pubdate xpath=/rss/channel/item/pubDate dateTimeFormat=-MM-dd'T'HH:mm:ss'Z' / field column=dcdate xpath=/rss/channel/item/dc:date dateTimeFormat=-MM-dd'T'HH:mm:ss'Z' / field column=storexpath=/rss/channel/item/georss:point / /entity The process completely skips over any path with a colon in it. ie. /rss/channel/item/georss:point. Any ideas how to get around this using the DIH? Thanks to Chris Mattmann for the heads up on the geocoding services. Adam
Re: [DIH] and XML Namespaces
Piece of cake! http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_ExampleOur XPath support has its limitations (no wildcards , only fullpath etc) but we have tried to make sure that common use-cases are covered and since it's based on a streaming parser, it is extremely fast and consumes constant amount of memory even for large XMLs. It does not support namespaces , but it can handle xmls with namespaces . When you provide the xpath, just drop the namespace and give the rest (eg if the tag is 'dc:subject' the mapping should just contain'subject').Easy, isn't it? And you didn't need to write one line of code! Enjoy [image: :)] On Wed, Dec 29, 2010 at 12:05 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: All, I am indexing some RSS feeds that are bound to specific namespaces. See below... dataConfig dataSource type=HttpDataSource encoding=UTF-8 connectionTimeout=50 readTimeout=50/ document entity name=filedatasource processor=FileListEntityProcessor baseDir=C:/Apache/Solr-Nightly/solr/example/solr/conf/dataimporthandler fileName=^.*xml$ recursive=true rootEntity=false dataSource=null entity name=CBP pk=link datasource=filedatasource url= http://ws.geonames.org/rssToGeoRSS?geoRSS=simpleamp;feedUrl=http://www.cbp.gov/xp/cgov/admin/rss/?rssUrl=/home.xmlhttp://ws.geonames.org/rssToGeoRSS?geoRSS=simplefeedUrl=http://www.cbp.gov/xp/cgov/admin/rss/?rssUrl=/home.xml processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/item transformer=DateFormatTransformer,HTMLStripTransformer field column=source xpath=/rss/channel/title commonField=true / field column=source-link xpath=/rss/channel/link commonField=true / field column=subject xpath=/rss/channel/description commonField=true / field column=titlexpath=/rss/channel/item/title / field column=link xpath=/rss/channel/item/link / field column=description xpath=/rss/channel/item/description stripHTML=true / field column=creator xpath=/rss/channel/item/dc:creator / field column=item-subject xpath=/rss/channel/item/subject / field column=author xpath=/rss/channel/item/author / field column=comments xpath=/rss/channel/item/comments / field column=pubdate xpath=/rss/channel/item/pubDate dateTimeFormat=-MM-dd'T'HH:mm:ss'Z' / field column=dcdate xpath=/rss/channel/item/dc:date dateTimeFormat=-MM-dd'T'HH:mm:ss'Z' / field column=storexpath=/rss/channel/item/georss:point / /entity The process completely skips over any path with a colon in it. ie. /rss/channel/item/georss:point. Any ideas how to get around this using the DIH? Thanks to Chris Mattmann for the heads up on the geocoding services. Adam
Re: SpatialTierQueryParserPlugin Loading Error
Hi Grant, I grabbed the latest version from trunk this morning and am still unable to get any of the spatial functionality to work. I still seem to be getting the class loading errors that I was getting when using the patches and jar files I found all over the web. What I really need at this point is an example of solrconfig.xml and whatever else I need to include to make it work properly. I am using the Geonames DB with valid lat/longs in decimal degrees so I'm confident that the data are correct. I have tried several examples all with the same results. There are other patches like the following that show snippets of how to modify the solrconfig file but there is no definitive source... https://issues.apache.org/jira/secure/attachment/12452781/SOLR-2077.Quach.Mattmann.082210.patch.txt I would gladly update this page if I could just get it working. http://wiki.apache.org/solr/SpatialSearch w/r, Adam On Tue, Dec 14, 2010 at 9:04 AM, Grant Ingersoll gsing...@apache.orgwrote: For this functionality, you are probably better off using trunk or branch_3x. There are quite a few patches related to that particular one that you will need to apply in order to have it work correctly. On Dec 13, 2010, at 10:06 PM, Adam Estrada wrote: All, Can anyone shed some light on this error. I can't seem to get this class to load. I am using the distribution of Solr from Lucid Imagination and the Spatial Plugin from here https://issues.apache.org/jira/browse/SOLR-773. I don't know how to apply a patch but the jar file is in there. What else can I do? org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413) at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:435) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1498) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1492) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1525) at org.apache.solr.core.SolrCore.initQParsers(SolrCore.java:1442) at org.apache.solr.core.SolrCore.init(SolrCore.java:548) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594) at org.mortbay.jetty.servlet.Context.startContext(Context.java:139) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117) at org.mortbay.jetty.Server.doStart(Server.java:210) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.mortbay.start.Main.invokeMain(Main.java:183) at org.mortbay.start.Main.start(Main.java:497) at org.mortbay.start.Main.main(Main.java:115) Caused by: java.lang.ClassNotFoundException: org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source
Re: [Import Timeout] using /dataimport
All, That link is great but I am still getting timeout issues which causes the entire import to fail. The feeds that are failing are like Newsweek and USA Today which are very widely used. It's strange because sometimes they work and sometimes they don't. I think that there are still timeout issues and adding the params suggested in that article don't seem to fix it. Adam On Tue, Dec 21, 2010 at 8:04 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: (10/12/22 9:35), Adam Estrada wrote: All, I've noticed that there are some RSS feeds that are slow to respond, especially during high usage times throughout the day. Is there a way to set the timeout to something really high or have it just wait until the feed is returned? The entire thing stops working when the feed doesn't respond. Your ideas are greatly appreciated. Adam readTimeout? http://wiki.apache.org/solr/DataImportHandler#Configuration_of_URLDataSource_or_HttpDataSource Koji -- http://www.rondhuit.com/en/
Re: [Reload-Config] not working
I also noticed that when I run the config-reload command, the following warning is thrown. I changed all my PK=id to see if that changed anything. Anyone have any ideas why this is not working for me? INFO: id is a required field in SolrSchema . But not found in DataConfig. Regards, Adm On Mon, Dec 20, 2010 at 10:58 AM, Adam Estrada estrada.a...@gmail.comwrote: This is the response I get...Does it matter that the configuration file is called something other than data-config.xml? After I get this I still have to restart the service. I wonder...do I need to commit the change? ?xml version=1.0 encoding=UTF-8 ? -http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config# response -http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config# lst name=*responseHeader* int name=*status*0/int int name=*QTime*520/int /lst -http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config# lst name=*initArgs* -http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config# lst name=*defaults* str name=*config*./solr/conf/dataimporthandler/rss.xml/str /lst /lst str name=*command*reload-config/str str name=*status*idle/str str name=*importResponse*Configuration Re-loaded sucessfully/str lst name=*statusMessages* / str name=*WARNING*This response format is experimental. It is likely to change in the future./str /response On Sun, Dec 19, 2010 at 11:12 PM, Ahmet Arslan iori...@yahoo.com wrote: a href= http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=full-import Full Import/abr / a href= http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config Reload Configuration/a All, The links above are meant for me to reload the configuration file after a change is made and the other is to perform the full import. My problem is that The reload-config option does not seem to be working. Am I doing anything wrong? Your expertise is greatly appreciated! I am sorry, I hit the reply button accidentally. Are you receiving/checking the message str name=importResponseConfiguration Re-loaded sucessfully/str after the reload? And are checking that data-config.xml is a valid xml after editing it programatically? And instead of editing data-config.xml file cant you use variable resolver? http://search-lucene.com/m/qYzPk2n86iIsubj
[Import Timeout] using /dataimport
All, I've noticed that there are some RSS feeds that are slow to respond, especially during high usage times throughout the day. Is there a way to set the timeout to something really high or have it just wait until the feed is returned? The entire thing stops working when the feed doesn't respond. Your ideas are greatly appreciated. Adam
Re: [Reload-Config] not working
This is the response I get...Does it matter that the configuration file is called something other than data-config.xml? After I get this I still have to restart the service. I wonder...do I need to commit the change? ?xml version=1.0 encoding=UTF-8 ? -http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config# response -http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config# lst name=*responseHeader* int name=*status*0/int int name=*QTime*520/int /lst -http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config# lst name=*initArgs* -http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config# lst name=*defaults* str name=*config*./solr/conf/dataimporthandler/rss.xml/str /lst /lst str name=*command*reload-config/str str name=*status*idle/str str name=*importResponse*Configuration Re-loaded sucessfully/str lst name=*statusMessages* / str name=*WARNING*This response format is experimental. It is likely to change in the future./str /response On Sun, Dec 19, 2010 at 11:12 PM, Ahmet Arslan iori...@yahoo.com wrote: a href= http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=full-import Full Import/abr / a href= http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config Reload Configuration/a All, The links above are meant for me to reload the configuration file after a change is made and the other is to perform the full import. My problem is that The reload-config option does not seem to be working. Am I doing anything wrong? Your expertise is greatly appreciated! I am sorry, I hit the reply button accidentally. Are you receiving/checking the message str name=importResponseConfiguration Re-loaded sucessfully/str after the reload? And are checking that data-config.xml is a valid xml after editing it programatically? And instead of editing data-config.xml file cant you use variable resolver? http://search-lucene.com/m/qYzPk2n86iIsubj
[Nutch] and Solr integration
All, I have a couple websites that I need to crawl and the following command line used to work I think. Solr is up and running and everything is fine there and I can go through and index the site but I really need the results added to Solr after the crawl. Does anyone have any idea on how to make that happen or what I'm doing wrong? These errors are being thrown fro Hadoop which I am not using at all. $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex ht tp://localhost:8983/solr crawl started in: crawl rootUrlDir = http://localhost:8983/solr threads = 10 depth = 100 indexer=lucene topN = 50 Injector: starting at 2010-12-20 15:23:25 Injector: crawlDb: crawl/crawldb Injector: urlDir: http://localhost:8983/solr Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375 ) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:169) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 81) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.Injector.inject(Injector.java:217) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
Re: [Nutch] and Solr integration
bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex http://localhost:8983/solr I've run that command before and it worked...that's why I asked. grab nutch from trunk and run bin/nutch and see that it is in fact an option. It looks like Hadoop is the culprit now and I am at a loss on how to fix it. Thanks for the feedback. Adam On Mon, Dec 20, 2010 at 4:21 PM, Anurag anurag.it.jo...@gmail.com wrote: why are using solrindex in the argument.? It is used when we need to index the crawled data in Solr For more read http://wiki.apache.org/nutch/NutchTutorial . Also for nutch-solr integration this is very useful blog http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ I integrated nutch and solr and it works well. Thanks On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com ml-node%2b2122347-622655030-146...@n3.nabble.comml-node%252b2122347-622655030-146...@n3.nabble.com wrote: All, I have a couple websites that I need to crawl and the following command line used to work I think. Solr is up and running and everything is fine there and I can go through and index the site but I really need the results added to Solr after the crawl. Does anyone have any idea on how to make that happen or what I'm doing wrong? These errors are being thrown fro Hadoop which I am not using at all. $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex ht tp://localhost:8983/solr crawl started in: crawl rootUrlDir = http://localhost:8983/solr threads = 10 depth = 100 indexer=lucene topN = 50 Injector: starting at 2010-12-20 15:23:25 Injector: crawlDb: crawl/crawldb Injector: urlDir: http://localhost:8983/solr Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375 ) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:169) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 81) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.Injector.inject(Injector.java:217) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) -- View message @ http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html To start a new topic under Solr - User, email ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com ml-node%2b472068-1941297125-146...@n3.nabble.comml-node%252b472068-1941297125-146...@n3.nabble.com To unsubscribe from Solr - User, click here http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY= . -- Kumar Anurag - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html Sent from the Solr - User mailing list archive at Nabble.com.
[Reload-Config] not working
a href= http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=full-import;Full Import/abr / a href= http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config;Reload Configuration/a All, The links above are meant for me to reload the configuration file after a change is made and the other is to perform the full import. My problem is that The reload-config option does not seem to be working. Am I doing anything wrong? Your expertise is greatly appreciated! Adam
Re: indexing a lot of XML dokuments
I have been very successful in following this example http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_ExampleAdam On Thu, Dec 16, 2010 at 5:44 AM, Jörg Agatz joerg.ag...@googlemail.comwrote: hi, users, i serch e way to indexing a lot of iml Dokuments so fast as Possible. i have more than 1 million docs on Server 1 and a SolR multicor an Server 2 with tomcat. i dont know ho i can do it easy and fast.. I cant find a idea in the wiki, maby you have some ideas? King
Re: bulk commits
what is it that you are trying to commit? a On Thu, Dec 16, 2010 at 1:03 PM, Dennis Gearon gear...@sbcglobal.netwrote: What have people found as the best way to do bulk commits either from the web or from a file on the system? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: bulk commits
,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xam.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xan.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xao.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xap.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update -H Content-Type: text/xml --data-binary 'optimize/' Adam On Thu, Dec 16, 2010 at 1:44 PM, Dennis Gearon gear...@sbcglobal.netwrote: Might be Csv or tab delimited text. Sent from Yahoo! Mail on Android -- * From: * Adam Estrada estrada.adam.gro...@gmail.com; * To: * solr-user@lucene.apache.org; * Subject: * Re: bulk commits * Sent: * Thu, Dec 16, 2010 6:35:17 PM what is it that you are trying to commit? a On Thu, Dec 16, 2010 at 1:03 PM, Dennis Gearon gear...@sbcglobal.net wrote: What have people found as the best way to do bulk commits either from the web or from a file on the system? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: bulk commits
One very important thing I forgot to mention is that you will have to increase the JAVA heap size for larger data sets. Set JAVA_OPT to something acceptable. Adam On Thu, Dec 16, 2010 at 3:27 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Dec 16, 2010 at 3:06 PM, Dennis Gearon gear...@sbcglobal.net wrote: That easy, huh? Heck, this gets better and better. BTW, how about escaping? The CSV escaping? It's configurable to allow for loading different CSV dialects. http://wiki.apache.org/solr/UpdateCSV By default it uses double quote encapsulation, like excel would. The bottom of the wiki page shows how to configure tab separators and backslash escaping like MySQL produces by default. -Yonik http://www.lucidimagination.com Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Adam Estrada estrada.adam.gro...@gmail.com To: Dennis Gearon gear...@sbcglobal.net; solr-user@lucene.apache.org Sent: Thu, December 16, 2010 10:58:47 AM Subject: Re: bulk commits This is how I import a lot of data from a cvs file. There are close to 100k records in there. Note that you can either pre-define the column names using the fieldnames param like I did here *or* include header=true which will automatically pick up the column header if your file has it. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 This seems to load everything in to some kind of temporary location before it's actually committed. If something goes wrong there is a rollback feature that will undo anything that happened before the commit. As far as batching a bunch of files, I copied and pasted the following in to Cygwin and it worked just fine. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xab.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xac.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xad.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xae.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xaf.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xag.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xah.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr
Re: [DIH] Example for SQL Server
Thanks All, Testing here shortly and will report back asap. w/r, Adam On Wed, Dec 15, 2010 at 4:10 AM, Savvas-Andreas Moysidis savvas.andreas.moysi...@googlemail.com wrote: Hi Adam, we are using DIH to index off an SQL Server database(the freeby SQLExpress one.. ;) ). We have defined the following in our %TOMCAT_HOME%\solr\conf\data-config.xml: dataConfig dataSource type=JdbcDataSource name=mssqlDatasource driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver://{server.name }:{server.port}/{dbInstanceName};instance=SQLEXPRESS convertType=true user={user.name} password={user.password}/ document entity name=id dataSource=mssqlDatasource query=your query here / /document /dataConfig We downloaded a JDBC driver from here http://jtds.sourceforge.net/faq.htmland found it to be a quite stable driver. And the only thing we really had to do was drop that library in %TOMCAT_HOME%\lib directory (for Tomcat 6+). Hope that helps. -- Savvas. On 14 December 2010 22:46, Erick Erickson erickerick...@gmail.com wrote: The config isn't really any different for various sql instances, about the only difference is the driver. Have you seen the example in the distribution somewhere like solr_home/example/example-DIH/solr/db/conf/db-data-config.xml? Also, there's a magic URL for debugging DIH at: .../solr/admin/dataimport.jsp If none of that is useful, could you post your attempt and maybe someone can offer some hints? Best Erick On Tue, Dec 14, 2010 at 5:32 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Does anyone have an example config.xml file I can take a look at for SQL Server? I need to index a lot of data from a DB and can't seem to figure out the right syntax so any help would be greatly appreciated. What is the correct /jar file to use and where do I put it in order for it to work? Thanks, Adam
Re: Dataimport performance
What version of Solr are you using? Adam 2010/12/15 Robert Gründler rob...@dubture.com Hi, we're looking for some comparison-benchmarks for importing large tables from a mysql database (full import). Currently, a full-import of ~ 8 Million rows from a MySQL database takes around 3 hours, on a QuadCore Machine with 16 GB of ram and a Raid 10 storage setup. Solr is running on a apache tomcat instance, where it is the only app. The tomcat instance has the following memory-related java_opts: -Xms4096M -Xmx5120M The data-config.xml looks like this (only 1 entity): entity name=track query=select t.id as id, t.title as title, l.title as label from track t left join label l on (l.id = t.label_id) where t.deleted = 0 transformer=TemplateTransformer field column=title name=title_t / field column=label name=label_t / field column=id name=sf_meta_id / field column=metaclass template=Track name=sf_meta_class/ field column=metaid template=${track.id} name=sf_meta_id/ field column=uniqueid template=Track_${track.id} name=sf_unique_id/ entity name=artists query=select a.name as artist from artist a left join track_artist ta on (ta.artist_id = a.id) where ta.track_id=${ track.id} field column=artist name=artists_t / /entity /entity We have the feeling that 3 hours for this import is quite long - regarding the performance of the server running solr/mysql. Are we wrong with that assumption, or do people experience similar import times with this amount of data to be imported? thanks! -robert
[Adding] Entities when indexing a DB
All, I have successfully indexed a single entity but when I try multiple entities is the second is skipped all together. Is there something wrong with my config file? ?xml version=1.0 encoding=utf-8 ? dataConfig dataSource type=JdbcDataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://10.0.2.93;databaseName=50_DEV user=adam password=password/ document name=events entity datasource=MISSIONS query = SELECT IdMission AS id, CoreGroup AS cat, StrMissionname AS subject, strDescription AS description, DateCreated AS pubdate FROM dbo.tblMission field column=id name=id / field column=cat name=cat / field column=subject name=subject / field column=description name=description / field column=pubdate name=date / /entity entity datasource=EVENTS query = SELECT strsubject AS subject, strsummary as description, datecreated as date, CoreGroup as cat, idevent as id FROM dbo.tblEvent field column=id name=id / field column=cat name=cat / field column=subject name=subject / field column=description name=description / field column=pubdate name=date / /entity /document /dataConfig
Re: [Adding] Entities when indexing a DB
Ahhh...I found that I did not set a dataSource name and when I did that and then referred each entity to that dataSource all went according to plan ;-) ?xml version=1.0 encoding=utf-8 ? dataConfig dataSource type=JdbcDataSource name=bleh driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://server;databaseName=50_DEV user=adam password=pw/ document entity name=Missions dataSource=bleh query = SELECT (IdMission + 100) AS id, idMission as missionid, CoreGroup AS cat, StrMissionname AS subject, strDescription AS description, DateCreated AS pubdate, 'Mission' AS cat2 FROM dbo.tblMission field column=id name=id / field column=missionid name=missionid / field column=cat name=cat / field column=cat2 name=cat2 / field column=subject name=subject / field column=description name=description / field column=pubdate name=date / /entity entity name=Events dataSource=bleh query = SELECT strsubject AS subject, strsummary as description, datecreated as date, CoreGroup as cat, idevent as id, 'Event' AS cat2, IdEvent AS missionid FROM dbo.tblEvent field column=id name=id / field column=missionid name=missionid / field column=cat name=cat / field column=cat2 name=cat2 / field column=subject name=subject / field column=description name=description / field column=pubdate name=date / /entity /document /dataConfig Solr Rocks! Adam On Wed, Dec 15, 2010 at 3:53 PM, Allistair Crossley a...@roxxor.co.ukwrote: mission.id and event.id if the same value will be overwriting the indexed document. your ids need to be unique across all documents. i usually have a field id_original that i map the table id to, and then for id per entity i usually prefix it with the entity name in the value mapped to the schema id field On 15 Dec 2010, at 20:49, Adam Estrada wrote: All, I have successfully indexed a single entity but when I try multiple entities is the second is skipped all together. Is there something wrong with my config file? ?xml version=1.0 encoding=utf-8 ? dataConfig dataSource type=JdbcDataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://10.0.2.93;databaseName=50_DEV user=adam password=password/ document name=events entity datasource=MISSIONS query = SELECT IdMission AS id, CoreGroup AS cat, StrMissionname AS subject, strDescription AS description, DateCreated AS pubdate FROM dbo.tblMission field column=id name=id / field column=cat name=cat / field column=subject name=subject / field column=description name=description / field column=pubdate name=date / /entity entity datasource=EVENTS query = SELECT strsubject AS subject, strsummary as description, datecreated as date, CoreGroup as cat, idevent as id FROM dbo.tblEvent field column=id name=id / field column=cat name=cat / field column=subject name=subject / field column=description name=description / field column=pubdate name=date / /entity /document /dataConfig
Thank you!
I just want to say that this list serve has been invaluable to a newbie like me ;-) I posted a question earlier today and literally 10 minutes later I got an answer that helped me solve my problem. This is proof that there is a experienced and energetic community behind this FOSS group of projects and I really appreciate everyone who has put up with my otherwise trivial questions! More importantly, thanks to all of the contributors who make the whole thing possible! I attended the Lucene Revolution conference in Boston this year and the information that I was able to take away from the whole thing has made me and my vocation a lot more valuable. Keep up the outstanding work in the discovery of useful information from a sea of bleh ;-) Kindest regards, Adam
[DIH] Example for SQL Server
Does anyone have an example config.xml file I can take a look at for SQL Server? I need to index a lot of data from a DB and can't seem to figure out the right syntax so any help would be greatly appreciated. What is the correct /jar file to use and where do I put it in order for it to work? Thanks, Adam
Re: [pubDate] is not converting correctly
+1 If I knew enough about how to do this in Java I would but I do not s.What is the correct way to add or suggest enhancements to Solr core? Adam On Sun, Dec 12, 2010 at 11:38 PM, Lance Norskog goks...@gmail.com wrote: Nice find! This is Apache 2.0, copyright SUN. O Great Apache Elders: Is it kosher to add this to the Solr distribution? It's not in the JDK and is also com.sun.* On Sun, Dec 12, 2010 at 5:33 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Thanks for the feedback! There are quite a few formats that can be used. I am experiencing at least 5 of them. Would something like this work? Note that there are 2 different formats separated by a comma. field column=pubdate xpath=/rss/channel/item/pubDate dateTimeFormat=EEE, dd MMM HH:mm:ss zzz, -MM-dd'T'HH:mm:ss'Z' / I don't suppose it will because there is already a comma in the first parser. I guess I am reallly looking for an all purpose data time parser but even if I have that, would I still be able to query *all* fields in the index? Good article: http://www.java2s.com/Open-Source/Java-Document/RSS-RDF/Rome/com/sun/syndication/io/impl/DateParser.java.htm Adam On Sun, Dec 12, 2010 at 7:31 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: (10/12/13 8:49), Adam Estrada wrote: All, I am having some difficulties parsing the pubDate field that is part of the? RSS spec (I believe). I get the warning that states, Dec 12, 2010 6:45:26 PM org.apache.solr.handler.dataimport.DateFormatTransformer transformRow WARNING: Could not parse a Date field java.text.ParseException: Unparseable date: Thu, 30 Jul 2009 14:41:43 + at java.text.DateFormat.parse(Unknown Source) Does anyone know how to fix this? I would eventually like to do a date query but without the ability to properly parse them I don't know if it's going to work. Thanks, Adam Adam, How does your data-config.xml look like for that field? Have you looked at rss-data-config.xml file under example/example-DIH/solr/rss/conf directory? Koji -- http://www.rondhuit.com/en/ -- Lance Norskog goks...@gmail.com
Re: Indexing pdf files - question.
Hi, I use the following command to post PDF files. $ curl http://localhost:8983/solr/update/extract?stream.file=C :\temp\document.docxstream.contentType=application/mswordliteral.id =esc.doccommit=true $ curl http://localhost:8983/solr/update/extract?stream.file=C :\temp\features.pdfstream.contentType=application/pdfliteral.id =esc2.doccommit=true $ curl http://localhost:8983/solr/update/extract?stream.file=C :\temp\Memo_ocrd.pdfstream.contentType=application/pdfliteral.id =Memo_ocrd.pdfdefaultField=textcommit=true The PDF's have to be OCR'd. Adam On Mon, Dec 13, 2010 at 11:01 AM, Siebor, Wlodek [USA] siebor_wlo...@bah.com wrote: HI, Can sombody, please, send me a command for indexing a sample pdf with ExtractngRequestHandler file available in the /docs directory. I have lucidworks solr installed on linux, with standard schema.xml and solrconfig.xml files (unchanged). I want to pass as the unique id the name of the file. I’m trying various curl commands and so far I have either “… missing required field: id” or “.. missing content stream” errors. Thanks for your help, Wlodek
Re: [pubDate] is not converting correctly
My first submission ;-) https://issues.apache.org/jira/browse/SOLR-2286 https://issues.apache.org/jira/browse/SOLR-2286Adam On Mon, Dec 13, 2010 at 5:14 PM, Lance Norskog goks...@gmail.com wrote: Create an account at https://issues.apache.org/jira/secure/Dashboard.jspa and do 'Create New Issue' for the Solr project. On Mon, Dec 13, 2010 at 2:13 PM, Lance Norskog goks...@gmail.com wrote: Please file a JIRA requesting this. On Mon, Dec 13, 2010 at 6:29 AM, Adam Estrada estrada.a...@gmail.com wrote: +1 If I knew enough about how to do this in Java I would but I do not s.What is the correct way to add or suggest enhancements to Solr core? Adam On Sun, Dec 12, 2010 at 11:38 PM, Lance Norskog goks...@gmail.com wrote: Nice find! This is Apache 2.0, copyright SUN. O Great Apache Elders: Is it kosher to add this to the Solr distribution? It's not in the JDK and is also com.sun.* On Sun, Dec 12, 2010 at 5:33 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Thanks for the feedback! There are quite a few formats that can be used. I am experiencing at least 5 of them. Would something like this work? Note that there are 2 different formats separated by a comma. field column=pubdate xpath=/rss/channel/item/pubDate dateTimeFormat=EEE, dd MMM HH:mm:ss zzz, -MM-dd'T'HH:mm:ss'Z' / I don't suppose it will because there is already a comma in the first parser. I guess I am reallly looking for an all purpose data time parser but even if I have that, would I still be able to query *all* fields in the index? Good article: http://www.java2s.com/Open-Source/Java-Document/RSS-RDF/Rome/com/sun/syndication/io/impl/DateParser.java.htm Adam On Sun, Dec 12, 2010 at 7:31 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: (10/12/13 8:49), Adam Estrada wrote: All, I am having some difficulties parsing the pubDate field that is part of the? RSS spec (I believe). I get the warning that states, Dec 12, 2010 6:45:26 PM org.apache.solr.handler.dataimport.DateFormatTransformer transformRow WARNING: Could not parse a Date field java.text.ParseException: Unparseable date: Thu, 30 Jul 2009 14:41:43 + at java.text.DateFormat.parse(Unknown Source) Does anyone know how to fix this? I would eventually like to do a date query but without the ability to properly parse them I don't know if it's going to work. Thanks, Adam Adam, How does your data-config.xml look like for that field? Have you looked at rss-data-config.xml file under example/example-DIH/solr/rss/conf directory? Koji -- http://www.rondhuit.com/en/ -- Lance Norskog goks...@gmail.com -- Lance Norskog goks...@gmail.com -- Lance Norskog goks...@gmail.com
SpatialTierQueryParserPlugin Loading Error
All, Can anyone shed some light on this error. I can't seem to get this class to load. I am using the distribution of Solr from Lucid Imagination and the Spatial Plugin from here https://issues.apache.org/jira/browse/SOLR-773. I don't know how to apply a patch but the jar file is in there. What else can I do? org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413) at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:435) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1498) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1492) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1525) at org.apache.solr.core.SolrCore.initQParsers(SolrCore.java:1442) at org.apache.solr.core.SolrCore.init(SolrCore.java:548) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594) at org.mortbay.jetty.servlet.Context.startContext(Context.java:139) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117) at org.mortbay.jetty.Server.doStart(Server.java:210) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.mortbay.start.Main.invokeMain(Main.java:183) at org.mortbay.start.Main.start(Main.java:497) at org.mortbay.start.Main.main(Main.java:115) Caused by: java.lang.ClassNotFoundException: org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:357) ... 33 more
Re: SOLR geospatial
I am particularly interested in storing and querying polygons. That sort of thing looks like its on their roadmap so does anyone know what the status is on that? Also, integration with JTS would make this a core component of any GIS. Again, anyone know what the status is on that? *What’s on the roadmap of future features?* Here are some of the features and henhancements we're planning for SSP: - Performance improvements for larger data sets - Fixing of known bugs - Distance facets: Allowing Solr users to be able to filter their results based on the calculated distances. - Search with regular polygons, and groups of shapes - Integration with JTS - Highly optimized distance calculation algorithms - Ranking results by distance - 3D dimension search Adam On Sun, Dec 12, 2010 at 12:01 AM, Markus Jelsma markus.jel...@openindex.iowrote: That smells like: http://www.jteam.nl/news/spatialsolr.html My partner is using a publicly available plugin for GeoSpatial. It is used both during indexing and during search. It forms some kind of gridding system and puts 10 fields per row related to that. Doing a Radius search (vs a bounding box search which is faster in almost all cases in all GeoSpatial query systems) seems pretty fast. GeoSpatial was our project's constraint. We've moved past that now. Did I mention that it returns distance from the center of the radius based on units supplied in the query? I would tell you what the plugin is, but in our division of labor, I have kept that out of my short term memory. You can contact him at: Danilo Unite danilo.un...@gmail.com; Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: George Anthony pa...@rogers.com To: solr-user@lucene.apache.org Sent: Fri, December 10, 2010 9:23:18 AM Subject: SOLR geospatial In looking at some of the docs support for geospatial search. I see this functionality is mostly scheduled for upcoming release 4.0 (with some playing around with backported code). I note the support for the bounding box filter, but will bounding box be one of the supported *data* types for use with this filter? For example, if my lat/long data describes the footprint of a map, I'm curious if that type of coordinate data can be used by the bounding box filter (or in any other way for similar limiting/filtering capability). I see it can work with point type data but curious about functionality with bounding box type data (in contrast to simple point lat/long data). Thanks, George
Re: SOLR geospatial
I would be more than happy to help with any of the spatial testing you are working on. adam On Sun, Dec 12, 2010 at 3:08 PM, Dennis Gearon gear...@sbcglobal.netwrote: We're in Alpha, heading to Alpha 2. Our requirements are simple: radius searching, and distance from center. Solr Spatial works and is current. GeoSpatial is almost there, but we're going to wait until it's released to spend time with it. We have other tasks to work on and don't want to be part of the debugging process of any project right now. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Sent: Sun, December 12, 2010 11:18:03 AM Subject: Re: SOLR geospatial By and large, spatial solr is being replaced by geospatial, see: http://wiki.apache.org/solr/SpatialSearch. I don't think the old spatial contrib is still included in the trunk or 3.x code bases, but I could be wrong That said, I don't know whether what you want is on the roadmap there either. Here's a place to start if you want to see the JIRA discussions: https://issues.apache.org/jira/browse/SOLR-1568 Best Erick On Sun, Dec 12, 2010 at 11:23 AM, Adam Estrada estrada.a...@gmail.com wrote: I am particularly interested in storing and querying polygons. That sort of thing looks like its on their roadmap so does anyone know what the status is on that? Also, integration with JTS would make this a core component of any GIS. Again, anyone know what the status is on that? *What’s on the roadmap of future features?* Here are some of the features and henhancements we're planning for SSP: - Performance improvements for larger data sets - Fixing of known bugs - Distance facets: Allowing Solr users to be able to filter their results based on the calculated distances. - Search with regular polygons, and groups of shapes - Integration with JTS - Highly optimized distance calculation algorithms - Ranking results by distance - 3D dimension search Adam On Sun, Dec 12, 2010 at 12:01 AM, Markus Jelsma markus.jel...@openindex.iowrote: That smells like: http://www.jteam.nl/news/spatialsolr.html My partner is using a publicly available plugin for GeoSpatial. It is used both during indexing and during search. It forms some kind of gridding system and puts 10 fields per row related to that. Doing a Radius search (vs a bounding box search which is faster in almost all cases in all GeoSpatial query systems) seems pretty fast. GeoSpatial was our project's constraint. We've moved past that now. Did I mention that it returns distance from the center of the radius based on units supplied in the query? I would tell you what the plugin is, but in our division of labor, I have kept that out of my short term memory. You can contact him at: Danilo Unite danilo.un...@gmail.com; Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: George Anthony pa...@rogers.com To: solr-user@lucene.apache.org Sent: Fri, December 10, 2010 9:23:18 AM Subject: SOLR geospatial In looking at some of the docs support for geospatial search. I see this functionality is mostly scheduled for upcoming release 4.0 (with some playing around with backported code). I note the support for the bounding box filter, but will bounding box be one of the supported *data* types for use with this filter? For example, if my lat/long data describes the footprint of a map, I'm curious if that type of coordinate data can be used by the bounding box filter (or in any other way for similar limiting/filtering capability). I see it can work with point type data but curious about functionality with bounding box type data (in contrast to simple point lat/long data). Thanks, George
Re: [Multiple] RSS Feeds at a time...
Hi Ahmet, This is a great idea but still does not appear to be working correctly. The idea is that I want to be able to add an RSS feed and then index that feed on a schedule. My C# method looks something like this. public ActionResult Index() { try { HTTPGet req = new HTTPGet(); string solrStr = System.Configuration.ConfigurationManager.AppSettings[solrUrl].ToString(); req.Request(solrStr + /select?clean=truecommit=trueqt=/dataimportcommand=reload-config); req.Request(solrStr + /select?clean=falsecommit=trueqt=/dataimportcommand=full-import); Response.Write(req.StatusLine); Response.Write(req.ResponseTime); Response.Write(req.StatusCode); return RedirectToAction(../Import/Feeds); //return View(); } catch (SolrConnectionException) { throw new Exception(string.Format(Couldn't Import RSS Feeds)); } } My XML configuration file looks somethiing like this... dataConfig dataSource type=HttpDataSource / document entity name=filedatasource processor=FileListEntityProcessor baseDir=./solr/conf/dataimporthandler fileName=^.*xml$ recursive=true rootEntity=false dataSource=null entity name=cnn pk=link datasource=filedatasource url=http://rss.cnn.com/rss/cnn_topstories.rss; processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/item transformer=DateFormatTransformer,HTMLStripTransformer field column=source xpath=/rss/channel/title commonField=true / field column=source-link xpath=/rss/channel/link commonField=true / field column=subject xpath=/rss/channel/description commonField=true / field column=titlexpath=/rss/channel/item/title / field column=link xpath=/rss/channel/item/link / field column=description xpath=/rss/channel/item/description stripHTML=true / field column=creator xpath=/rss/channel/item/creator / field column=item-subject xpath=/rss/channel/item/subject / field column=author xpath=/rss/channel/item/author / field column=comments xpath=/rss/channel/item/comments / field column=pubdate xpath=/rss/channel/item/pubDate dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / /entity entity name=newsweek pk=link datasource=filedatasource url=http://feeds.newsweek.com/newsweek/nation; processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/item transformer=DateFormatTransformer,HTMLStripTransformer field column=source xpath=/rss/channel/title commonField=true / field column=source-link xpath=/rss/channel/link commonField=true / field column=subject xpath=/rss/channel/description commonField=true / field column=titlexpath=/rss/channel/item/title / field column=link xpath=/rss/channel/item/link / field column=description xpath=/rss/channel/item/description stripHTML=true / field column=creator xpath=/rss/channel/item/creator / field column=item-subject xpath=/rss/channel/item/subject / field column=author xpath=/rss/channel/item/author / field column=comments xpath=/rss/channel/item/comments / field column=pubdate xpath=/rss/channel/item/pubDate dateTimeFormat=-MM-dd'T'hh:mm:ss'Z'/ /entity /entity /document /dataConfig As you can see, I can add sub-entities from what appears to be as many times as I want. The idea was to reload the xml file after each entity is added. What else am I missing here because the reload-config command does not seem to be working. Any ideas would be great! Thanks, Adam Estrada On Sat, Dec 11, 2010 at 4:48 PM, Ahmet Arslan iori...@yahoo.com wrote: I found that you can have a single config file that can have several entities in it. My question now is how can I add entities without restarting the Solr service? You mean changing and re-loading xml config file? dataimport?command=reload-config http://wiki.apache.org/solr/DataImportHandler#Commands
[pubDate] is not converting correctly
All, I am having some difficulties parsing the pubDate field that is part of the RSS spec (I believe). I get the warning that states, Dec 12, 2010 6:45:26 PM org.apache.solr.handler.dataimport.DateFormatTransformer transformRow WARNING: Could not parse a Date field java.text.ParseException: Unparseable date: Thu, 30 Jul 2009 14:41:43 + at java.text.DateFormat.parse(Unknown Source) Does anyone know how to fix this? I would eventually like to do a date query but without the ability to properly parse them I don't know if it's going to work. Thanks, Adam
Re: [pubDate] is not converting correctly
Thanks for the feedback! There are quite a few formats that can be used. I am experiencing at least 5 of them. Would something like this work? Note that there are 2 different formats separated by a comma. field column=pubdate xpath=/rss/channel/item/pubDate dateTimeFormat=EEE, dd MMM HH:mm:ss zzz, -MM-dd'T'HH:mm:ss'Z' / I don't suppose it will because there is already a comma in the first parser. I guess I am reallly looking for an all purpose data time parser but even if I have that, would I still be able to query *all* fields in the index? Good article: http://www.java2s.com/Open-Source/Java-Document/RSS-RDF/Rome/com/sun/syndication/io/impl/DateParser.java.htm Adam On Sun, Dec 12, 2010 at 7:31 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: (10/12/13 8:49), Adam Estrada wrote: All, I am having some difficulties parsing the pubDate field that is part of the? RSS spec (I believe). I get the warning that states, Dec 12, 2010 6:45:26 PM org.apache.solr.handler.dataimport.DateFormatTransformer transformRow WARNING: Could not parse a Date field java.text.ParseException: Unparseable date: Thu, 30 Jul 2009 14:41:43 + at java.text.DateFormat.parse(Unknown Source) Does anyone know how to fix this? I would eventually like to do a date query but without the ability to properly parse them I don't know if it's going to work. Thanks, Adam Adam, How does your data-config.xml look like for that field? Have you looked at rss-data-config.xml file under example/example-DIH/solr/rss/conf directory? Koji -- http://www.rondhuit.com/en/
Re: Indexing documents with SOLR
Pankaj, Check this article out on how to get going with Nutch. http://bit.ly/dbBdK4This is a few months old so you will have to note that there is a new parameter called something like -SolrUrl that will allow you to update your solr index with the crawled data. For crawling your local file system, you will have to change the http:// to file:// in your seed.txt file to point to the directory you want to crawl. Another VERY important option is to increase your Java heap size. I do this by using the JAVA_OPT environment variable. Adam On Sat, Dec 11, 2010 at 8:27 AM, pankaj bhatt panbh...@gmail.com wrote: Hi Adam, Thanks a lot for pointing me out to NUTCH. Can you please tell me, is through NUTCH Can I read teh directory on local system or on a shared file system. Will wait for your response. / Pankaj Bhatt On Fri, Dec 10, 2010 at 9:35 PM, Adam Estrada estrada.a...@gmail.comwrote: Nutch is also a great option if you want a crawler. I have found that you will need to use the latest version of PDFBox and a it's dependencies for better results. Also, make sure to set JAVA_OPT to something really large so that you won't exceed your heap size. Adam On Fri, Dec 10, 2010 at 6:27 AM, Tommaso Teofili tommaso.teof...@gmail.comwrote: Hi Pankaj, you can find the needed documentation right here [1]. Hope this helps, Tommaso [1] : http://wiki.apache.org/solr/ExtractingRequestHandler 2010/12/10 pankaj bhatt panbh...@gmail.com Hi All, I am a newbie to SOLR and trying to integrate TIKA + SOLR. Can anyone please guide me, how to achieve this. * My Req is:* I have a directory containing a lot of PDF,DOC's and i need to make a search within the documents. I am using SOLR web application. I just need some sample xml code both for solr-config.xml and the directory-schema.xml Awaiting eagerly for your response. Regards, Pankaj Bhatt.
Re: [Multiple] RSS Feeds at a time...
at 10:38 PM, Lance Norskog goks...@gmail.com wrote: There is I believe no way to do this without separate copies of your script. Each 'handler=/dataimport' has to refer to a separate config file. You can make several copies and name them config1.xml, config2.xml etc. You'll have to call each one manually, so you have to manage your own thread pool. On Fri, Dec 10, 2010 at 8:15 AM, Adam Estrada estrada.adam.gro...@gmail.com wrote: All, Right now I am using the default DIH config that comes with the Solr examples. I update my index using the dataimport handler here http://localhost:8983/solr/admin/dataimport.jsp?handler=/dataimport This works fine but I want to be able to index more than just one feed at a time and more importantly I want to be able to index both ATOM and RSS feeds which means that the schema will definitely be different. There is a good example on how to index all of the example docs in the SolrNet example application but that is looking for xml files with the properly formatted xml tags. foreach (var file in Directory.GetFiles(Server.MapPath(/exampledocs), *.xml)) { connection.Post(/update, File.ReadAllText(file, Encoding.UTF8)); } solr.Commit(); example xml: - add - doc field name=*id*F8V7067-APL-KIT/field field name=*name*Belkin Mobile Power Cord for iPod w/ Dock/field field name=*manu*Belkin/field field name=*cat*electronics/field field name=*cat*connector/field field name=*features*car power adapter, white/field field name=*weight*4/field field name=*price*19.95/field field name=*popularity*1/field field name=*inStock*false/field field name=*manufacturedate_dt*2005-08-01T16:30:25Z/field /doc /add This obviously won't help me when trying to grab random RSS feeds so my question is, how can I ingest several feeds at a time? Can I do this programmatically or is there a configuration option I am missing? Thanks, Adam -- Lance Norskog goks...@gmail.com
Re: [Multiple] RSS Feeds at a time...
You are da man! w00t! adam On Sat, Dec 11, 2010 at 4:48 PM, Ahmet Arslan iori...@yahoo.com wrote: I found that you can have a single config file that can have several entities in it. My question now is how can I add entities without restarting the Solr service? You mean changing and re-loading xml config file? dataimport?command=reload-config http://wiki.apache.org/solr/DataImportHandler#Commands
[Parsing] Date Fields
All, I am ingesting a lot of RSS feeds as part of my application and I keep getting the same error. WARNING: Could not parse a Date field java.text.ParseException: Unparseable date: Mon, 06 Dec 2010 23:31:38 + at java.text.DateFormat.parse(Unknown Source) at org.apache.solr.handler.dataimport.DateFormatTransformer.process(Date FormatTransformer.java:89) at org.apache.solr.handler.dataimport.DateFormatTransformer.transformRow (DateFormatTransformer.java:69) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransf ormer(EntityProcessorWrapper.java:195) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent ityProcessorWrapper.java:241) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:357) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:383) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j ava:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java :180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo rter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j ava:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja va:370) Dec 11, 2010 6:25:47 PM org.apache.solr.handler.dataimport.DocBuilder finish INFO: Import completed successfully Dec 11, 2010 6:25:47 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDelete s=false) Are there any tips or tricks to getting standard RSS update fields to import correctly? An example for a DIH config XML file is as follows: entity name=CBS pk=link datasource=filedatasource url=http://feeds.cbsnews.com/CBSNewsMain?format=xml; processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/item transformer=DateFormatTransformer,HTMLStripTransformer field column=source xpath=/rss/channel/title commonField=true / field column=source-link xpath=/rss/channel/link commonField=true / field column=subject xpath=/rss/channel/description commonField=true / field column=titlexpath=/rss/channel/item/title / field column=link xpath=/rss/channel/item/link / field column=description xpath=/rss/channel/item/description stripHTML=true / field column=creator xpath=/rss/channel/item/creator / field column=item-subject xpath=/rss/channel/item/subject / field column=author xpath=/rss/channel/item/author / field column=comments xpath=/rss/channel/item/comments / field column=pubdate xpath=/rss/channel/item/pubDate dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / /entity Any tips on this would be really appreciated as I need to query based on the date the article was published. Thanks, Adam
Re: Indexing documents with SOLR
Nutch is also a great option if you want a crawler. I have found that you will need to use the latest version of PDFBox and a it's dependencies for better results. Also, make sure to set JAVA_OPT to something really large so that you won't exceed your heap size. Adam On Fri, Dec 10, 2010 at 6:27 AM, Tommaso Teofili tommaso.teof...@gmail.comwrote: Hi Pankaj, you can find the needed documentation right here [1]. Hope this helps, Tommaso [1] : http://wiki.apache.org/solr/ExtractingRequestHandler 2010/12/10 pankaj bhatt panbh...@gmail.com Hi All, I am a newbie to SOLR and trying to integrate TIKA + SOLR. Can anyone please guide me, how to achieve this. * My Req is:* I have a directory containing a lot of PDF,DOC's and i need to make a search within the documents. I am using SOLR web application. I just need some sample xml code both for solr-config.xml and the directory-schema.xml Awaiting eagerly for your response. Regards, Pankaj Bhatt.
[Multiple] RSS Feeds at a time...
All, Right now I am using the default DIH config that comes with the Solr examples. I update my index using the dataimport handler here http://localhost:8983/solr/admin/dataimport.jsp?handler=/dataimport This works fine but I want to be able to index more than just one feed at a time and more importantly I want to be able to index both ATOM and RSS feeds which means that the schema will definitely be different. There is a good example on how to index all of the example docs in the SolrNet example application but that is looking for xml files with the properly formatted xml tags. foreach (var file in Directory.GetFiles(Server.MapPath(/exampledocs), *.xml)) { connection.Post(/update, File.ReadAllText(file, Encoding.UTF8)); } solr.Commit(); example xml: - add - doc field name=*id*F8V7067-APL-KIT/field field name=*name*Belkin Mobile Power Cord for iPod w/ Dock/field field name=*manu*Belkin/field field name=*cat*electronics/field field name=*cat*connector/field field name=*features*car power adapter, white/field field name=*weight*4/field field name=*price*19.95/field field name=*popularity*1/field field name=*inStock*false/field field name=*manufacturedate_dt*2005-08-01T16:30:25Z/field /doc /add This obviously won't help me when trying to grab random RSS feeds so my question is, how can I ingest several feeds at a time? Can I do this programmatically or is there a configuration option I am missing? Thanks, Adam
[Multiple] RSS Feeds and Source Field
All, I am indexing RSS feeds from several sources so I have a couple questions. 1. There is only 1 source for each RSS feed which is typically the name of the feed, I get an error in my app stating *Value cannot be null. Parameter name: source* I look at the index in Luke and there are data values in there. Any ideas on why my app would be throwing that? 2. I would like to ingest several feeds at a time. What is the proper way to define them in a the XML config file? Can I have two document tags in there or am I limited to just one? Adam
Re: [Multiple] RSS Feeds and Source Field
In Luke I looked at the available fields and term counts per field and there is a source field without an asterisk beside it. The source value is CNN.com which is what I would expect it to be. I still get a null value in my app which is probably a bug somewhere in my application. Any more of your suggestions on the index would be greatly appreciated Adam On Thu, Dec 9, 2010 at 3:46 PM, Jonathan Rochkind rochk...@jhu.edu wrote: You look at what index in Luke? I bet you $10 there is no index called source* in your index. With an asterisk in it. On 12/9/2010 3:23 PM, Adam Estrada wrote: All, I am indexing RSS feeds from several sources so I have a couple questions. 1. There is only 1 source for each RSS feed which is typically the name of the feed, I get an error in my app stating *Value cannot be null. Parameter name: source* I look at the index in Luke and there are data values in there. Any ideas on why my app would be throwing that? 2. I would like to ingest several feeds at a time. What is the proper way to define them in a the XML config file? Can I have twodocument tags in there or am I limited to just one? Adam
Re: Open source Solr UI with multiple select faceting?
SolrNet has a great example application that you can use...There is a great Javascript project called SolrAjax but I don't know what the state of it is. Adam On Thu, Dec 9, 2010 at 4:53 PM, Andy angelf...@yahoo.com wrote: Hi, Any open source Solr UI's that support selecting multiple facet values (OR faceting)? For example allowing a user to select red or blue for the facet field Color. I'd prefer libraries in javascript or Python. I know about ajax-solr but it doesn't seem to support multiple selects. Thanks.
Re: [Multiple] RSS Feeds and Source Field
I ended up copying the source field to another which seems to have fixed the problem...I still have so much to learn about when it comes to using Solr... Thanks for all the great feedback, Adam On Thu, Dec 9, 2010 at 11:03 PM, Erick Erickson erickerick...@gmail.comwrote: Hmmm, you say you get an error i your app. I'm a bit confused. Is this before you try to send it to Solr or as a result of sending it to Solr? If the latter, I'd wager source is required in your schema and you're not sending it in your document. Try instrumenting your app to check that every outgoing document has a value If that's irrelevant, can we see your schema file? You can send as many documents in a packet as you want. Best Erick On Thu, Dec 9, 2010 at 3:23 PM, Adam Estrada estrada.adam.gro...@gmail.comwrote: All, I am indexing RSS feeds from several sources so I have a couple questions. 1. There is only 1 source for each RSS feed which is typically the name of the feed, I get an error in my app stating *Value cannot be null. Parameter name: source* I look at the index in Luke and there are data values in there. Any ideas on why my app would be throwing that? 2. I would like to ingest several feeds at a time. What is the proper way to define them in a the XML config file? Can I have two document tags in there or am I limited to just one? Adam
[Casting] values on update/csv
All, I have a csv file and I want to store one of the fields as a tdouble type. It does not like that at all...Is there a way to cast the string value to a tdouble? Thanks, Adam
Re: [Casting] values on update/csv
Hi, I am using curl to run the following and as soon as I convert the field type from string to tdouble, I get the errors you see below. 0:0:0:0:0:0:0:1 - - [08/12/2010:23:28:27 +] GET /solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C:\tmp\allCountries\xaa.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 HTTP/1.1 500 4023 I am trying to index coordinates in decimal degrees so many of them have negative values. Could this be the problem? Dec 8, 2010 6:28:27 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NumberFormatException: For input string: lat at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Double.parseDouble(Unknown Source) at org.apache.solr.schema.TrieField.createField(TrieField.java:431) at org.apache.solr.schema.SchemaField.createField(SchemaField.java:94) at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.jav a:246) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpd ateProcessorFactory.java:60) at org.apache.solr.handler.CSVLoader.doAdd(CSVRequestHandler.java:386) at org.apache.solr.handler.SingleThreadedCSVLoader.addDoc(CSVRequestHand ler.java:400) at org.apache.solr.handler.CSVLoader.load(CSVRequestHandler.java:363) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co ntentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl erBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle Request(RequestHandlers.java:233) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter .java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte r.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(Servlet Handler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:3 65) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.jav a:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:1 81) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:7 12) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHand lerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection. java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:1 39) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:50 2) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpCo nnection.java:821) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector. java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool .java:442) Dec 8, 2010 6:28:27 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update/csv params={fieldnames=id,name,asciiname,lat, lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catcommi t=trueoverwrite=truestream.contentType=text/plain;charset%3Dutf-8separator=, stream.file=C:\tmp\allCountries\xaa.csv} status=500 QTime=52 Dec 8, 2010 6:28:27 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NumberFormatException: For input string: lat at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Double.parseDouble(Unknown Source) at org.apache.solr.schema.TrieField.createField(TrieField.java:431) at org.apache.solr.schema.SchemaField.createField(SchemaField.java:94) at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.jav a:246) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpd ateProcessorFactory.java:60) at org.apache.solr.handler.CSVLoader.doAdd(CSVRequestHandler.java:386) at org.apache.solr.handler.SingleThreadedCSVLoader.addDoc(CSVRequestHand ler.java:400) at org.apache.solr.handler.CSVLoader.load(CSVRequestHandler.java:363) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co ntentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl erBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle
Re: Batch Update Fields
OK so the way I understand this is that if there is a synonym on a specific field at index time, that value will be stored rather than the one in the csv that I am indexing? I will give it a whirl and report back... Thanks! Adam On Sat, Dec 4, 2010 at 2:27 PM, Erick Erickson erickerick...@gmail.comwrote: When you define your fieldType at index time. My idea was that you substitue these on the way in to your index. You may need a specific field type just for your country conversion Perhaps in a copyField if you need both the code and full name Best Erick On Sat, Dec 4, 2010 at 12:16 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Synonyms eh? I have a synonym list like the following so how do I identify the synonyms on a specific field. The only place the field is used is as a facet. original field = country name AF = AFGHANISTAN AX = ÅLAND ISLANDS AL = ALBANIA DZ = ALGERIA AS = AMERICAN SAMOA AD = ANDORRA AO = ANGOLA AI = ANGUILLA AQ = ANTARCTICA AG = ANTIGUA AND BARBUDA AR = ARGENTINA AM = ARMENIA AW = ARUBA AU = AUSTRALIA AT = AUSTRIA etc... Any advise on that would be great and very much appreciated! Adam On Fri, Dec 3, 2010 at 3:55 PM, Erick Erickson erickerick...@gmail.com wrote: That will certainly work. Another option, assuming the country codes are in their own field would be to put the transformations into a synonym file that was only used on that field. That way you'd get this without having to do the pre-process step of the raw data... That said, if you pre-processing is working for you it may not be worth your while to worry about doing it differently Best Erick On Fri, Dec 3, 2010 at 12:51 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: First off...I know enough about Solr to be VERY dangerous so please bare with me ;-) I am indexing the geonames database which only provides country codes. I can facet the codes but to the end user who may not know all 249 codes, it isn't really all that helpful. Therefore, I want to map the full country names to the country codes provided in the geonames db. http://download.geonames.org/export/dump/ http://download.geonames.org/export/dump/I used a simple split function to chop the 850 meg txt file in to manageable csv's that I can import in to Solr. Now that all 7 million + documents are in there, I want to change the country codes to the actual country names. I would of liked to have done it in the index but finding and replacing the strings in the csv seems to be working fine. After that I can just reindex the entire thing. Adam On Fri, Dec 3, 2010 at 12:42 PM, Erick Erickson erickerick...@gmail.com wrote: Have you consider defining synonyms for your code -country conversion at index time (or query time for that matter)? We may have an XY problem here. Could you state the high-level problem you're trying to solve? Maybe there's a better solution... Best Erick On Fri, Dec 3, 2010 at 12:20 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: I wonder...I know that sed would work to find and replace the terms in all of the csv files that I am indexing but would it work to find and replace key terms in the index? find C:\\tmp\\index\\data -type f -exec sed -i 's/AF/AFGHANISTAN/g' {} \; That command would iterate through all the files in the data directory and replace the country code with the full country name. I many just back up the directory and try it. I have it running on csv files right now and it's working wonderfully. For those of you interested, I am indexing the entire Geonames dataset http://download.geonames.org/export/dump/(allCountries.zip) which gives me a pretty comprehensive world gazetteer. My next step is gonna be to display the results as KML to view over a google globe. Thoughts? Adam On Fri, Dec 3, 2010 at 7:57 AM, Erick Erickson erickerick...@gmail.com wrote: No, there's no equivalent to SQL update for all values in a column. You'll have to reindex all the documents. On Thu, Dec 2, 2010 at 10:52 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: OK part 2 of my previous question... Is there a way to batch update field values based on a certain criteria? For example, if thousands of documents have a field value of 'US' can I update all of them to 'United States' programmatically? Adam
Re: Batch Update Fields
Synonyms eh? I have a synonym list like the following so how do I identify the synonyms on a specific field. The only place the field is used is as a facet. original field = country name AF = AFGHANISTAN AX = ÅLAND ISLANDS AL = ALBANIA DZ = ALGERIA AS = AMERICAN SAMOA AD = ANDORRA AO = ANGOLA AI = ANGUILLA AQ = ANTARCTICA AG = ANTIGUA AND BARBUDA AR = ARGENTINA AM = ARMENIA AW = ARUBA AU = AUSTRALIA AT = AUSTRIA etc... Any advise on that would be great and very much appreciated! Adam On Fri, Dec 3, 2010 at 3:55 PM, Erick Erickson erickerick...@gmail.comwrote: That will certainly work. Another option, assuming the country codes are in their own field would be to put the transformations into a synonym file that was only used on that field. That way you'd get this without having to do the pre-process step of the raw data... That said, if you pre-processing is working for you it may not be worth your while to worry about doing it differently Best Erick On Fri, Dec 3, 2010 at 12:51 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: First off...I know enough about Solr to be VERY dangerous so please bare with me ;-) I am indexing the geonames database which only provides country codes. I can facet the codes but to the end user who may not know all 249 codes, it isn't really all that helpful. Therefore, I want to map the full country names to the country codes provided in the geonames db. http://download.geonames.org/export/dump/ http://download.geonames.org/export/dump/I used a simple split function to chop the 850 meg txt file in to manageable csv's that I can import in to Solr. Now that all 7 million + documents are in there, I want to change the country codes to the actual country names. I would of liked to have done it in the index but finding and replacing the strings in the csv seems to be working fine. After that I can just reindex the entire thing. Adam On Fri, Dec 3, 2010 at 12:42 PM, Erick Erickson erickerick...@gmail.com wrote: Have you consider defining synonyms for your code -country conversion at index time (or query time for that matter)? We may have an XY problem here. Could you state the high-level problem you're trying to solve? Maybe there's a better solution... Best Erick On Fri, Dec 3, 2010 at 12:20 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: I wonder...I know that sed would work to find and replace the terms in all of the csv files that I am indexing but would it work to find and replace key terms in the index? find C:\\tmp\\index\\data -type f -exec sed -i 's/AF/AFGHANISTAN/g' {} \; That command would iterate through all the files in the data directory and replace the country code with the full country name. I many just back up the directory and try it. I have it running on csv files right now and it's working wonderfully. For those of you interested, I am indexing the entire Geonames dataset http://download.geonames.org/export/dump/(allCountries.zip) which gives me a pretty comprehensive world gazetteer. My next step is gonna be to display the results as KML to view over a google globe. Thoughts? Adam On Fri, Dec 3, 2010 at 7:57 AM, Erick Erickson erickerick...@gmail.com wrote: No, there's no equivalent to SQL update for all values in a column. You'll have to reindex all the documents. On Thu, Dec 2, 2010 at 10:52 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: OK part 2 of my previous question... Is there a way to batch update field values based on a certain criteria? For example, if thousands of documents have a field value of 'US' can I update all of them to 'United States' programmatically? Adam
Re: Batch Update Fields
I wonder...I know that sed would work to find and replace the terms in all of the csv files that I am indexing but would it work to find and replace key terms in the index? find C:\\tmp\\index\\data -type f -exec sed -i 's/AF/AFGHANISTAN/g' {} \; That command would iterate through all the files in the data directory and replace the country code with the full country name. I many just back up the directory and try it. I have it running on csv files right now and it's working wonderfully. For those of you interested, I am indexing the entire Geonames dataset http://download.geonames.org/export/dump/ (allCountries.zip) which gives me a pretty comprehensive world gazetteer. My next step is gonna be to display the results as KML to view over a google globe. Thoughts? Adam On Fri, Dec 3, 2010 at 7:57 AM, Erick Erickson erickerick...@gmail.comwrote: No, there's no equivalent to SQL update for all values in a column. You'll have to reindex all the documents. On Thu, Dec 2, 2010 at 10:52 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: OK part 2 of my previous question... Is there a way to batch update field values based on a certain criteria? For example, if thousands of documents have a field value of 'US' can I update all of them to 'United States' programmatically? Adam
Re: Batch Update Fields
First off...I know enough about Solr to be VERY dangerous so please bare with me ;-) I am indexing the geonames database which only provides country codes. I can facet the codes but to the end user who may not know all 249 codes, it isn't really all that helpful. Therefore, I want to map the full country names to the country codes provided in the geonames db. http://download.geonames.org/export/dump/ http://download.geonames.org/export/dump/I used a simple split function to chop the 850 meg txt file in to manageable csv's that I can import in to Solr. Now that all 7 million + documents are in there, I want to change the country codes to the actual country names. I would of liked to have done it in the index but finding and replacing the strings in the csv seems to be working fine. After that I can just reindex the entire thing. Adam On Fri, Dec 3, 2010 at 12:42 PM, Erick Erickson erickerick...@gmail.comwrote: Have you consider defining synonyms for your code -country conversion at index time (or query time for that matter)? We may have an XY problem here. Could you state the high-level problem you're trying to solve? Maybe there's a better solution... Best Erick On Fri, Dec 3, 2010 at 12:20 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: I wonder...I know that sed would work to find and replace the terms in all of the csv files that I am indexing but would it work to find and replace key terms in the index? find C:\\tmp\\index\\data -type f -exec sed -i 's/AF/AFGHANISTAN/g' {} \; That command would iterate through all the files in the data directory and replace the country code with the full country name. I many just back up the directory and try it. I have it running on csv files right now and it's working wonderfully. For those of you interested, I am indexing the entire Geonames dataset http://download.geonames.org/export/dump/(allCountries.zip) which gives me a pretty comprehensive world gazetteer. My next step is gonna be to display the results as KML to view over a google globe. Thoughts? Adam On Fri, Dec 3, 2010 at 7:57 AM, Erick Erickson erickerick...@gmail.com wrote: No, there's no equivalent to SQL update for all values in a column. You'll have to reindex all the documents. On Thu, Dec 2, 2010 at 10:52 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: OK part 2 of my previous question... Is there a way to batch update field values based on a certain criteria? For example, if thousands of documents have a field value of 'US' can I update all of them to 'United States' programmatically? Adam
Joining Fields in and Index
All, I have an index that has a field with country codes in it. I have 7 million or so documents in the index and when displaying facets the country codes don't mean a whole lot to me. Is there any way to add a field with the full country names then join the codes in there accordingly? I suppose I can do this before updating the records in the index but before I do that I would like to know if there is a way to do this sort of join. Example: US - United States Thanks, Adam
Re: Joining Fields in and Index
Hi, I was hoping to do it directly in the index but it was more out of curiosity than anything. I can certainly map it in the DAO but again...I was hoping to learn if it was possible in the index. Thanks for the feedback! Adam On Dec 2, 2010, at 5:48 PM, Savvas-Andreas Moysidis wrote: Hi, If you are able to do a full re-index then you could index the full names and not the codes. When you later facet on the Country field you'll get the actual name rather than the code. If you are not able to re-index then probably this conversion could be added at your application layer prior to displaying your results.(e.g. in your DAO object) On 2 December 2010 22:05, Adam Estrada estrada.adam.gro...@gmail.comwrote: All, I have an index that has a field with country codes in it. I have 7 million or so documents in the index and when displaying facets the country codes don't mean a whole lot to me. Is there any way to add a field with the full country names then join the codes in there accordingly? I suppose I can do this before updating the records in the index but before I do that I would like to know if there is a way to do this sort of join. Example: US - United States Thanks, Adam
Using Multiple Cores for Multiple Users
All, I have a web application that requires the user to register and then login to gain access to the site. Pretty standard stuff...Now I would like to know what the best approach would be to implement a customized search experience for each user. Would this mean creating a separate core per user? I think that this is not possible without restarting Solr after each core is added to the multi-core xml file, right? My use case is this...User A would like to index 5 RSS feeds and User B would like to index 5 completely different RSS feeds and he is not interested at all in what User A is interested in. This means that they would have to be separate index cores, right? What is the best approach for this kind of thing? Thanks in advance, Adam
Re: Using Multiple Cores for Multiple Users
Thanks a lot for all the tips, guys! I think that we may explore both options just to see what happens. I'm sure that scalability will be a huge mess with the core-per-user scenario. I like the idea of creating a user ID field and agree that it's probably the best approach. We'll see...I will be sure to let the list know what I find! Please don't stop posting your comments everyone ;-) My inquiring mind wants to know... Adam On Tue, Nov 9, 2010 at 7:34 PM, Jonathan Rochkind rochk...@jhu.edu wrote: If storing in a single index (possibly sharded if you need it), you can simply include a solr field that specifies the user ID of the saved thing. On the client side, in your application, simply ensure that there is an fq parameter limiting to the current user, if you want to limit to the current user's stuff. Relevancy ranking should work just as if you had 'seperate cores', there is no relevancy issue. It IS true that when your index gets very large, commits will start taking longer, which can be a problem. I don't mean commits will take longer just because there is more stuff to commit -- the larger the index, the longer an update to a single document will take to commit. In general, i suspect that having dozens or hundreds (or thousands!) of cores is not going to scale well, it is not going to make good use of your cpu/ram/hd resources. Not really the intended use case of multiple cores. However, you are probably going to run into some issues with the single index approach too. In general, how to deal with multi-tenancy in Solr is an oft-asked question that there doesn't seem to be any just works and does everything for you without needing to think about it solution for in solr. Judging from past thread. I am not a Solr developer or expert. From: Markus Jelsma [markus.jel...@openindex.io] Sent: Tuesday, November 09, 2010 6:57 PM To: solr-user@lucene.apache.org Cc: Adam Estrada Subject: Re: Using Multiple Cores for Multiple Users Hi, All, I have a web application that requires the user to register and then login to gain access to the site. Pretty standard stuff...Now I would like to know what the best approach would be to implement a customized search experience for each user. Would this mean creating a separate core per user? I think that this is not possible without restarting Solr after each core is added to the multi-core xml file, right? No, you can dynamically manage cores and parts of their configuration. Sometimes you must reindex after a change, the same is true for reloading cores. Check the wiki on this one [1]. My use case is this...User A would like to index 5 RSS feeds and User B would like to index 5 completely different RSS feeds and he is not interested at all in what User A is interested in. This means that they would have to be separate index cores, right? If you view documents within an rss feed as a separate documents, you can assign an user ID to those documents, creating a multi user index with rss documents per user, or group or whatever. Having a core per user isn't a good idea if you have many users. It takes up additional memory and disk space, doesn't share caches etc. There is also more maintenance and your need some support scripts to dynamically create new cores - Solr currently doesn't create a new core directory structure. But, reindexing a very large index takes up a lot more time and resources and relevancy might be an issue depending on the rss feeds' contents. What is the best approach for this kind of thing? I'd usually store the feeds in a single index and shard if it's too many for a single server with your specifications. Unless the demands are too specific. Thanks in advance, Adam [1]: http://wiki.apache.org/solr/CoreAdmin Cheers