Re: SolrCloud, DIH, and XPathEntityProcessor
On 1/12/2016 6:05 AM, Tom Evans wrote: > Hi all, trying to move our Solr 4 setup to SolrCloud (5.4). Having > some problems with a DIH config that attempts to load an XML file and > iterate through the nodes in that file, it trys to load the file from > disk instead of from zookeeper. > > dataSource="lookup_conf" > rootEntity="false" > name="lookups" > processor="XPathEntityProcessor" > url="lookup_conf.xml" > forEach="/lookups/lookup"> > > The file exists in zookeeper, adjacent to the data_import.conf in the > lookups_config conf folder. SolrCloud puts all the *config* for Solr into zookeeper, and adds a new abstraction for indexes (the collection), but other parts of Solr like DIH are not really affected. The entity processors in DIH cannot retrieve data from zookeeper. They do not know how. Thanks, Shawn
Re: SolrCloud, DIH, and XPathEntityProcessor
Yeah, that's essentially the nature of open source, someone gets frustrated enough with current behavior and fixes it ;)... There's never any harm in opening a JIRA, all you need to do is register. It's not a bad idea to open on as you _start_ writing the code, even providing very early versions of your patch for people to comment on or to discuss approaches. And early comments may save you a lot of work! No guarantees of course. If you do put up a preliminary patch, just mention the current state in the comments. If you haven't seen it already, here's a primer: https://wiki.apache.org/solr/HowToContribute Best, Erick On Tue, Jan 12, 2016 at 7:16 AM, Tom Evanswrote: > On Tue, Jan 12, 2016 at 3:00 PM, Shawn Heisey wrote: >> On 1/12/2016 7:45 AM, Tom Evans wrote: >>> That makes no sense whatsoever. DIH loads the data_import.conf from ZK >>> just fine, or is that provided to DIH from another module that does >>> know about ZK? >> >> This is accomplished indirectly through a resource loader in the >> SolrCore object that is responsible for config files. Also, the >> dataimport handler is created by the main Solr code which then hands the >> configuration to the dataimport module. DIH itself does not know about >> zookeeper. > > ZkPropertiesWriter seems to know a little.. > >> >>> Either way, it is entirely sub-optimal to have SolrCloud store "all" >>> its configuration in ZK, but still require manually storing and >>> updating files on specific nodes in order to influence DIH. If a >>> server is mistakenly not updated, or manually modified locally on >>> disk, that node would start indexing documents differently than other >>> replicas, which sounds dangerous and scary! >> >> The entity processor you are using accesses files through a Java >> interface for mounted filesystems. As already mentioned, it does not >> know about zookeeper. >> >>> If there is not a ZkFileDataSource, it shouldn't be too tricky to add >>> one... I'll see how much I dislike having config files on the host... >> >> Creating your own DIH class would be the only solution available right now. >> >> I don't know how useful this would be in practice. Without special >> config in multiple places, Zookeeper limits the size of the files it >> contains to 1MB. It is not designed to deal with a large amount of data >> at once. > > This is not large amounts of data, it is a 5kb XML file containing > configuration of what tables to query for what fields and how to map > them in to the document. > >> >> You could submit a feature request in Jira, but unless you supply a >> complete patch that survives the review process, I do not know how >> likely an implementation would be. > > We've already started implementation, basing around FileDataSource and > using SolrZkClient, which we will deploy as an additional library > whilst that process is ongoing or doesn't survive it. > > Cheers > > Tom
SolrCloud, DIH, and XPathEntityProcessor
Hi all, trying to move our Solr 4 setup to SolrCloud (5.4). Having some problems with a DIH config that attempts to load an XML file and iterate through the nodes in that file, it trys to load the file from disk instead of from zookeeper. The file exists in zookeeper, adjacent to the data_import.conf in the lookups_config conf folder. The exception: 2016-01-12 12:59:47.852 ERROR (Thread-44) [c:lookups s:shard1 r:core_node6 x:lookups_shard1_replica2] o.a.s.h.d.DataImporter Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: lookup_conf.xml (resolved to: /mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:417) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:481) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:462) Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: lookup_conf.xml (resolved to: /mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233) ... 3 more Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: lookup_conf.xml (resolved to: /mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:62) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:287) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:225) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:202) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415) ... 5 more Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: lookup_conf.xml (resolved to: /mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:127) at org.apache.solr.handler.dataimport.FileDataSource.getData(FileDataSource.java:86) at org.apache.solr.handler.dataimport.FileDataSource.getData(FileDataSource.java:48) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:284) ... 10 more Caused by: java.io.FileNotFoundException: Could not find file: lookup_conf.xml (resolved to: /mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:123) ... 13 more Any hints gratefully accepted Cheers Tom
Re: SolrCloud, DIH, and XPathEntityProcessor
On 1/12/2016 7:45 AM, Tom Evans wrote: > That makes no sense whatsoever. DIH loads the data_import.conf from ZK > just fine, or is that provided to DIH from another module that does > know about ZK? This is accomplished indirectly through a resource loader in the SolrCore object that is responsible for config files. Also, the dataimport handler is created by the main Solr code which then hands the configuration to the dataimport module. DIH itself does not know about zookeeper. > Either way, it is entirely sub-optimal to have SolrCloud store "all" > its configuration in ZK, but still require manually storing and > updating files on specific nodes in order to influence DIH. If a > server is mistakenly not updated, or manually modified locally on > disk, that node would start indexing documents differently than other > replicas, which sounds dangerous and scary! The entity processor you are using accesses files through a Java interface for mounted filesystems. As already mentioned, it does not know about zookeeper. > If there is not a ZkFileDataSource, it shouldn't be too tricky to add > one... I'll see how much I dislike having config files on the host... Creating your own DIH class would be the only solution available right now. I don't know how useful this would be in practice. Without special config in multiple places, Zookeeper limits the size of the files it contains to 1MB. It is not designed to deal with a large amount of data at once. You could submit a feature request in Jira, but unless you supply a complete patch that survives the review process, I do not know how likely an implementation would be. Thanks, Shawn
Re: SolrCloud, DIH, and XPathEntityProcessor
On Tue, Jan 12, 2016 at 3:00 PM, Shawn Heiseywrote: > On 1/12/2016 7:45 AM, Tom Evans wrote: >> That makes no sense whatsoever. DIH loads the data_import.conf from ZK >> just fine, or is that provided to DIH from another module that does >> know about ZK? > > This is accomplished indirectly through a resource loader in the > SolrCore object that is responsible for config files. Also, the > dataimport handler is created by the main Solr code which then hands the > configuration to the dataimport module. DIH itself does not know about > zookeeper. ZkPropertiesWriter seems to know a little.. > >> Either way, it is entirely sub-optimal to have SolrCloud store "all" >> its configuration in ZK, but still require manually storing and >> updating files on specific nodes in order to influence DIH. If a >> server is mistakenly not updated, or manually modified locally on >> disk, that node would start indexing documents differently than other >> replicas, which sounds dangerous and scary! > > The entity processor you are using accesses files through a Java > interface for mounted filesystems. As already mentioned, it does not > know about zookeeper. > >> If there is not a ZkFileDataSource, it shouldn't be too tricky to add >> one... I'll see how much I dislike having config files on the host... > > Creating your own DIH class would be the only solution available right now. > > I don't know how useful this would be in practice. Without special > config in multiple places, Zookeeper limits the size of the files it > contains to 1MB. It is not designed to deal with a large amount of data > at once. This is not large amounts of data, it is a 5kb XML file containing configuration of what tables to query for what fields and how to map them in to the document. > > You could submit a feature request in Jira, but unless you supply a > complete patch that survives the review process, I do not know how > likely an implementation would be. We've already started implementation, basing around FileDataSource and using SolrZkClient, which we will deploy as an additional library whilst that process is ongoing or doesn't survive it. Cheers Tom
Re: SolrCloud, DIH, and XPathEntityProcessor
On Tue, Jan 12, 2016 at 2:32 PM, Shawn Heiseywrote: > On 1/12/2016 6:05 AM, Tom Evans wrote: >> Hi all, trying to move our Solr 4 setup to SolrCloud (5.4). Having >> some problems with a DIH config that attempts to load an XML file and >> iterate through the nodes in that file, it trys to load the file from >> disk instead of from zookeeper. >> >> > dataSource="lookup_conf" >> rootEntity="false" >> name="lookups" >> processor="XPathEntityProcessor" >> url="lookup_conf.xml" >> forEach="/lookups/lookup"> >> >> The file exists in zookeeper, adjacent to the data_import.conf in the >> lookups_config conf folder. > > SolrCloud puts all the *config* for Solr into zookeeper, and adds a new > abstraction for indexes (the collection), but other parts of Solr like > DIH are not really affected. The entity processors in DIH cannot > retrieve data from zookeeper. They do not know how. That makes no sense whatsoever. DIH loads the data_import.conf from ZK just fine, or is that provided to DIH from another module that does know about ZK? Either way, it is entirely sub-optimal to have SolrCloud store "all" its configuration in ZK, but still require manually storing and updating files on specific nodes in order to influence DIH. If a server is mistakenly not updated, or manually modified locally on disk, that node would start indexing documents differently than other replicas, which sounds dangerous and scary! If there is not a ZkFileDataSource, it shouldn't be too tricky to add one... I'll see how much I dislike having config files on the host... Cheers Tom