Re: Transform data at index time: country - continent
Hey, Since you're using solr and have access to the database in question did you consider making an extra index on the machine to hold your country to continent mapping ? I know it's more trouble than it's worth for such a small data set but hey, you get to set up another index :) -- View this message in context: http://lucene.472066.n3.nabble.com/Transform-data-at-index-time-country-continent-tp4082486p4083539.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Transform data at index time: country - continent
Hi, I have thought about synonyms as well. But wouldn't leave me this with a field that contains both the original expression and additionally the continent? e.g. germany, continent-europe. I am not sure if this might get in the way at some point. On the other hand this would enable my to have a single search field, where the user could search by country or continent. Interesting - I'll give it a thought. Thanx Chris Am 07.08.2013 17:56, schrieb Walter Underwood: Good point. Copying to a separate field that applied synonyms could help. Filtering out the original countries could be tricky. The Javadoc mentiones a keepOrig flag, but the Solr docs do not. If you could set keepOrig=false, that would do the trick. wunder On Aug 7, 2013, at 5:13 AM, Erick Erickson wrote: Walter: Oooh, nice! One could even use a copyField if one wanted to keep them separate... Erick On Tue, Aug 6, 2013 at 12:38 PM, Walter Underwood wun...@wunderwood.orgwrote: Would synonyms help? If you generate the query terms for the continents, you could do something like this: usa = continent-na canada = continent-na germany = continent-europe und so weiter. wunder On Aug 6, 2013, at 2:18 AM, Christian Köhler - ZFMK wrote: Am 05.08.2013 15:52, schrieb Jack Krupansky: You can write a brute force JavaScript script using the StatelessScript update processor that hard-codes the mapping. I'll probably do something like this. Unfortunately I have no influence on the original db itself, so I have fix this in solr. Cheers Chris -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Walter Underwood wun...@wunderwood.org -- Christian Köhler Tel.: 0228 9122-433 Zoologisches Forschungsmuseum Alexander Koenig Leibniz-Institut für Biodiversität der Tiere Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn
Re: Transform data at index time: country - continent
Hi, One interesting issue: These countries that span continents - Turkey and Russia and some of the former USSR Republics. I arbitrarily assigned them a single continent: // Note: Turkey is mapped to Asia, and Russia to Europe, // Azerbaijan to Asia, Armenia to Asia, Cyprus to Asia, // Georgia to Asia, Kazakhstan to Asia, I came across the same problem. Not to mention the oversee territories of France, the Netherlands, ... (I hope I don't get too much hate mail from the Greeks for considering Cyprus to be part of Asia, but it is closer.) I'd rather assign them to both continents. A false positive is (in my case) better than a miss. My data provides a geo coordinate for each record which I could use for a clarification when in doubt - but this might be an other topic. I suppose continent could be multivalued or maybe a composite string (eu/as or eu+as), but that has an impact on queries. But, the scripts also handles multivalued fields (one value at a time), and nested multivalued fields is not supported. Thoughts? -- Jack Krupansky -Original Message- From: Christian Köhler - ZFMK Sent: Tuesday, August 06, 2013 5:18 AM To: solr-user@lucene.apache.org Subject: Re: Transform data at index time: country - continent Am 05.08.2013 15:52, schrieb Jack Krupansky: You can write a brute force JavaScript script using the StatelessScript update processor that hard-codes the mapping. I'll probably do something like this. Unfortunately I have no influence on the original db itself, so I have fix this in solr. Cheers Chris -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Christian Köhler Tel.: 0228 9122-433 Zoologisches Forschungsmuseum Alexander Koenig Leibniz-Institut für Biodiversität der Tiere Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn
Re: Transform data at index time: country - continent
SynonymFilter may have a keepOrig flag. If so, that would map countries to continents and not keep the country names. filter class=solr.SynonymFilterFactory synonyms=continents.txt keepOrig=false / wunder On Aug 8, 2013, at 4:10 AM, Christian Köhler - ZFMK wrote: Hi, I have thought about synonyms as well. But wouldn't leave me this with a field that contains both the original expression and additionally the continent? e.g. germany, continent-europe. I am not sure if this might get in the way at some point. On the other hand this would enable my to have a single search field, where the user could search by country or continent. Interesting - I'll give it a thought. Thanx Chris Am 07.08.2013 17:56, schrieb Walter Underwood: Good point. Copying to a separate field that applied synonyms could help. Filtering out the original countries could be tricky. The Javadoc mentiones a keepOrig flag, but the Solr docs do not. If you could set keepOrig=false, that would do the trick. wunder On Aug 7, 2013, at 5:13 AM, Erick Erickson wrote: Walter: Oooh, nice! One could even use a copyField if one wanted to keep them separate... Erick On Tue, Aug 6, 2013 at 12:38 PM, Walter Underwood wun...@wunderwood.orgwrote: Would synonyms help? If you generate the query terms for the continents, you could do something like this: usa = continent-na canada = continent-na germany = continent-europe und so weiter. wunder On Aug 6, 2013, at 2:18 AM, Christian Köhler - ZFMK wrote: Am 05.08.2013 15:52, schrieb Jack Krupansky: You can write a brute force JavaScript script using the StatelessScript update processor that hard-codes the mapping. I'll probably do something like this. Unfortunately I have no influence on the original db itself, so I have fix this in solr. Cheers Chris -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Walter Underwood wun...@wunderwood.org -- Christian Köhler Tel.: 0228 9122-433 Zoologisches Forschungsmuseum Alexander Koenig Leibniz-Institut für Biodiversität der Tiere Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Walter Underwood wun...@wunderwood.org
Re: Transform data at index time: country - continent
(I think you're better off with an update processor script, but...) The synonym filter supports 2.5 modes: 1. Replace mode country = continent 2. Expand mode country, continent - results in both terms if either is used 2.5) The expand=false attribute that means treat expand mode as replace with the first term as the replacement. continent, country - would be treated as: country, continent = continent The expand=true attribute is simply the normal expand mode. Expand mode is really just replacement mode with the terms auto-copied to the right side of the =, so: country, continent is equivalent to: country, continent = country, continent -- Jack Krupansky -Original Message- From: Christian Köhler - ZFMK Sent: Thursday, August 08, 2013 7:10 AM To: solr-user@lucene.apache.org Subject: Re: Transform data at index time: country - continent Hi, I have thought about synonyms as well. But wouldn't leave me this with a field that contains both the original expression and additionally the continent? e.g. germany, continent-europe. I am not sure if this might get in the way at some point. On the other hand this would enable my to have a single search field, where the user could search by country or continent. Interesting - I'll give it a thought. Thanx Chris Am 07.08.2013 17:56, schrieb Walter Underwood: Good point. Copying to a separate field that applied synonyms could help. Filtering out the original countries could be tricky. The Javadoc mentiones a keepOrig flag, but the Solr docs do not. If you could set keepOrig=false, that would do the trick. wunder On Aug 7, 2013, at 5:13 AM, Erick Erickson wrote: Walter: Oooh, nice! One could even use a copyField if one wanted to keep them separate... Erick On Tue, Aug 6, 2013 at 12:38 PM, Walter Underwood wun...@wunderwood.orgwrote: Would synonyms help? If you generate the query terms for the continents, you could do something like this: usa = continent-na canada = continent-na germany = continent-europe und so weiter. wunder On Aug 6, 2013, at 2:18 AM, Christian Köhler - ZFMK wrote: Am 05.08.2013 15:52, schrieb Jack Krupansky: You can write a brute force JavaScript script using the StatelessScript update processor that hard-codes the mapping. I'll probably do something like this. Unfortunately I have no influence on the original db itself, so I have fix this in solr. Cheers Chris -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Walter Underwood wun...@wunderwood.org -- Christian Köhler Tel.: 0228 9122-433 Zoologisches Forschungsmuseum Alexander Koenig Leibniz-Institut für Biodiversität der Tiere Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn
Re: Transform data at index time: country - continent
Walter: Oooh, nice! One could even use a copyField if one wanted to keep them separate... Erick On Tue, Aug 6, 2013 at 12:38 PM, Walter Underwood wun...@wunderwood.orgwrote: Would synonyms help? If you generate the query terms for the continents, you could do something like this: usa = continent-na canada = continent-na germany = continent-europe und so weiter. wunder On Aug 6, 2013, at 2:18 AM, Christian Köhler - ZFMK wrote: Am 05.08.2013 15:52, schrieb Jack Krupansky: You can write a brute force JavaScript script using the StatelessScript update processor that hard-codes the mapping. I'll probably do something like this. Unfortunately I have no influence on the original db itself, so I have fix this in solr. Cheers Chris -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn
Re: Transform data at index time: country - continent
Good point. Copying to a separate field that applied synonyms could help. Filtering out the original countries could be tricky. The Javadoc mentiones a keepOrig flag, but the Solr docs do not. If you could set keepOrig=false, that would do the trick. wunder On Aug 7, 2013, at 5:13 AM, Erick Erickson wrote: Walter: Oooh, nice! One could even use a copyField if one wanted to keep them separate... Erick On Tue, Aug 6, 2013 at 12:38 PM, Walter Underwood wun...@wunderwood.orgwrote: Would synonyms help? If you generate the query terms for the continents, you could do something like this: usa = continent-na canada = continent-na germany = continent-europe und so weiter. wunder On Aug 6, 2013, at 2:18 AM, Christian Köhler - ZFMK wrote: Am 05.08.2013 15:52, schrieb Jack Krupansky: You can write a brute force JavaScript script using the StatelessScript update processor that hard-codes the mapping. I'll probably do something like this. Unfortunately I have no influence on the original db itself, so I have fix this in solr. Cheers Chris -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Walter Underwood wun...@wunderwood.org
Re: Transform data at index time: country - continent
Am 05.08.2013 15:52, schrieb Jack Krupansky: You can write a brute force JavaScript script using the StatelessScript update processor that hard-codes the mapping. I'll probably do something like this. Unfortunately I have no influence on the original db itself, so I have fix this in solr. Cheers Chris -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn
Re: Transform data at index time: country - continent
Another option might be to use a pre-existing web service... it should be relatively easy to add that to your dataimporthandler configuration (if you're using DIH, that is :-) A quick google search gave me http://www.geonames.org; see http://www.geonames.org/export/ for API information. On Tue, Aug 6, 2013 at 11:18 AM, Christian Köhler - ZFMK c.koeh...@zfmk.dewrote: Am 05.08.2013 15:52, schrieb Jack Krupansky: You can write a brute force JavaScript script using the StatelessScript update processor that hard-codes the mapping. I'll probably do something like this. Unfortunately I have no influence on the original db itself, so I have fix this in solr. Cheers Chris -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn
Re: Transform data at index time: country - continent
Hi, Am 06.08.2013 12:56, schrieb Raymond Wiker: Another option might be to use a pre-existing web service... it should be relatively easy to add that to your dataimporthandler configuration (if you're using DIH, that is :-) A quick google search gave me http://www.geonames.org; see http://www.geonames.org/export/ for API information. Interesting approach - thanx! I'll have to test the performance though. I am indexing millions of records, so the latency of the web service might be an issue. Cheers Chris -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn
Re: Transform data at index time: country - continent
Would synonyms help? If you generate the query terms for the continents, you could do something like this: usa = continent-na canada = continent-na germany = continent-europe und so weiter. wunder On Aug 6, 2013, at 2:18 AM, Christian Köhler - ZFMK wrote: Am 05.08.2013 15:52, schrieb Jack Krupansky: You can write a brute force JavaScript script using the StatelessScript update processor that hard-codes the mapping. I'll probably do something like this. Unfortunately I have no influence on the original db itself, so I have fix this in solr. Cheers Chris -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn
Re: Transform data at index time: country - continent
I've implemented a JavaScript script for the StatelessScriptUpdate processor that does country code to continent code mapping. It will appear in the next early access of my Solr 4.x Deep Dive book (on 8/16.) One interesting issue: These countries that span continents - Turkey and Russia and some of the former USSR Republics. I arbitrarily assigned them a single continent: // Note: Turkey is mapped to Asia, and Russia to Europe, // Azerbaijan to Asia, Armenia to Asia, Cyprus to Asia, // Georgia to Asia, Kazakhstan to Asia, (I hope I don't get too much hate mail from the Greeks for considering Cyprus to be part of Asia, but it is closer.) I suppose continent could be multivalued or maybe a composite string (eu/as or eu+as), but that has an impact on queries. But, the scripts also handles multivalued fields (one value at a time), and nested multivalued fields is not supported. Thoughts? -- Jack Krupansky -Original Message- From: Christian Köhler - ZFMK Sent: Tuesday, August 06, 2013 5:18 AM To: solr-user@lucene.apache.org Subject: Re: Transform data at index time: country - continent Am 05.08.2013 15:52, schrieb Jack Krupansky: You can write a brute force JavaScript script using the StatelessScript update processor that hard-codes the mapping. I'll probably do something like this. Unfortunately I have no influence on the original db itself, so I have fix this in solr. Cheers Chris -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn
Re: Transform data at index time: country - continent
Hi, please excuse the multiple emails to the list. There is a mailserver issue - our admin has fixed it (he said ...). @ list-admin: you may delete my previous duplicate mails (9:59, 10:01 an 10:34) from the list. Sorry for the noise! Chris -- Christian Köhler Tel.: 0228 9122-433 Zoologisches Forschungsmuseum Alexander Koenig Leibniz-Institut für Biodiversität der Tiere Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn
Re: Transform data at index time: country - continent
Don't know about best practice, but to me, the obvious solution would be to have a database table holding the relationships between countries and continents, and using a join to get the continent. On Mon, Aug 5, 2013 at 9:59 AM, Christian Köhler - ZFMK c.koeh...@zfmk.dewrote: Hi, I am indexing data from a mysql data source. Each record contains the field country. I am looking for a suitable way to create a field continent at indexing time. A list with the information country - continent is given. Writing a script and calling it as a transformer in the sql query would be my solution of choice right now. A RegexTransformer seems to be less elegant with 200+ different countries. Solr indexes 10 million records, so efficiency should be kept in mind. With my limited knowledge I might be missing the obvious solution. What would be the best practice? Any thoughts are welcome. Regards Chris -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn
Re: Transform data at index time: country - continent
Hi, to have a database table holding the relationships between countries and continents, and using a join to get the continent. I forgot to mention: I only have reading access to the database. Regards Chris -- Christian Köhler Tel.: 0228 9122-433 Zoologisches Forschungsmuseum Alexander Koenig Leibniz-Institut für Biodiversität der Tiere Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn
Re: Transform data at index time: country - continent
On 8/5/2013 3:02 AM, Christian Köhler - ZFMK wrote: to have a database table holding the relationships between countries and continents, and using a join to get the continent. I forgot to mention: I only have reading access to the database. Somebody's got to write something. If you don't have write access to the data, here are some things the DB admin could do: 1) Add a field to the table for the continent. Write a program that goes through the records, figures out the continent, and populates that field for every row. This would cause at least a little bit of DB downtime. 2) Set up the table that Raymond recommended, so you can do a JOIN in your SELECT statement. 3) Use DB server-side code (perhaps a stored procedure?) and give you a database view that uses that code to add a continent field to the results. It would be very good to have data like the continent in your source database. If the DB admin can't or won't do any of these things, then you'd have to do it yourself. This likely means one of two things: 1) Write an application to read the data from the database and index the data to Solr. In terms of Solr functionality, the Java API (SolrJ) is the most comprehensive. This would basically be a rewrite of the DataImport handler, but unless it's multi-threaded and written very carefully, it probably won't be as efficient as DIH. 2) Write a custom UpdateProcessor for the Solr server side that does the mapping, and continue using Solr's DataImport handler. Thanks, Shawn
Re: Transform data at index time: country - continent
You can write a brute force JavaScript script using the StatelessScript update processor that hard-codes the mapping. -- Jack Krupansky -Original Message- From: Christian Köhler - ZFMK Sent: Monday, August 05, 2013 5:02 AM To: solr-user@lucene.apache.org Subject: Re: Transform data at index time: country - continent Hi, to have a database table holding the relationships between countries and continents, and using a join to get the continent. I forgot to mention: I only have reading access to the database. Regards Chris -- Christian Köhler Tel.: 0228 9122-433 Zoologisches Forschungsmuseum Alexander Koenig Leibniz-Institut für Biodiversität der Tiere Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn