[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642871#comment-16642871 ] Noble Paul commented on SOLR-12798: --- I've created a separate issue to track this SOLR-12843 > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, SOLR-12798.patch, SOLR-12798.patch, > SOLR-12798.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16640660#comment-16640660 ] Noble Paul commented on SOLR-12798: --- with a test case. Using the same contentstream to post multiple file types > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, SOLR-12798.patch, SOLR-12798.patch, > SOLR-12798.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16640511#comment-16640511 ] Noble Paul commented on SOLR-12798: --- We never supported binary payloads from solrJ. For this format, we only support binary > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, SOLR-12798.patch, SOLR-12798.patch, no params in > url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16640492#comment-16640492 ] Lucene/Solr QA commented on SOLR-12798: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 39s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green} 1m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Check forbidden APIs {color} | {color:green} 1m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Validate source patterns {color} | {color:green} 1m 20s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 20m 41s{color} | {color:red} core in the patch failed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 4m 57s{color} | {color:red} solrj in the patch failed. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 30m 59s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | solr.schema.TestManagedSchemaAPI | | | solr.handler.TestRestoreCore | | | solr.cloud.LeaderFailureAfterFreshStartTest | | | solr.TestHighlightDedupGrouping | | | solr.cloud.TestCloudRecovery | | | solr.cloud.autoscaling.IndexSizeTriggerTest | | | solr.cloud.TestPullReplicaErrorHandling | | | solr.cloud.TriLevelCompositeIdRoutingTest | | | solr.cloud.api.collections.CustomCollectionTest | | | solr.handler.component.DistributedFacetPivotSmallTest | | | solr.schema.TestCloudSchemaless | | | solr.cloud.autoscaling.MetricTriggerIntegrationTest | | | solr.core.TestDynamicURP | | | solr.search.MergeStrategyTest | | | solr.handler.component.DistributedFacetPivotWhiteBoxTest | | | solr.cloud.ZkFailoverTest | | | solr.util.TestSolrCLIRunExample | | | solr.cloud.TestCloudPseudoReturnFields | | | solr.cloud.cdcr.CdcrWithNodesRestartsTest | | | solr.handler.component.DistributedDebugComponentTest | | | solr.cloud.LIRRollingUpdatesTest | | | solr.response.transform.TestSubQueryTransformerDistrib | | | solr.schema.TestBinaryField | | | solr.handler.component.DistributedFacetPivotLargeTest | | | solr.cloud.TestDownShardTolerantSearch | | | solr.cloud.MoveReplicaHDFSFailoverTest | | | solr.handler.admin.CoreAdminHandlerTest | | | solr.cloud.api.collections.TestCollectionAPI | | | solr.TestDistributedMissingSort | | | solr.cloud.BasicDistributedZk2Test | | | solr.cloud.TestShortCircuitedRequests | | | solr.cloud.SolrCloudExampleTest | | | solr.core.OpenCloseCoreStressTest | | | solr.core.TestDynamicLoading | | | solr.cloud.LeaderTragicEventTest | | | solr.cloud.MigrateRouteKeyTest | | | solr.handler.component.DistributedExpandComponentTest | | | solr.TestTolerantSearch | | | solr.handler.component.TestDistributedStatsComponentCardinality | | | solr.security.BasicAuthIntegrationTest | | | solr.handler.TestReplicationHandlerBackup | | | solr.cloud.TestCloudSearcherWarming | | | solr.cloud.TestCloudPivotFacet | | | solr.cloud.RecoveryZkTest | | | solr.security.hadoop.TestSolrCloudWithHadoopAuthPlugin | | | solr.handler.component.DistributedFacetPivotLongTailTest | | | solr.cloud.MissingSegmentRecoveryTest | | | solr.search.join.TestCloudNestedDocsSort | | | solr.DistributedIntervalFacetingTest | | | solr.cloud.TestCryptoKeys | | | solr.core.BlobRepositoryCloudTest | | | solr.cloud.DistribDocExpirationUpdateProcessorTest | | | solr.cloud.TestTolerantUpdateProcessorCloud | | | solr.search.stats.TestDefaultStatsCache | | | solr.cloud.cdcr.CdcrVersionReplicationTest | | | solr.update.SolrIndexSplitterTest | | | solr.cloud.TestPullReplica | | | solr.request.TestRemoteStreaming | | | solr.cloud.TestLeaderElectionWithEmptyReplica | | | solr.cloud.autoscaling.SystemLogListenerTest | | | solr.metrics.reporters.solr.SolrCloudReportersTest | | |
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16635135#comment-16635135 ] Mikhail Khludnev commented on SOLR-12798: - [~noble.paul], how do you propose to pass binary payloads in json? I've found one SO thread discussing it, which looks not so promising. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, SOLR-12798.patch, no params in url.png, > solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634537#comment-16634537 ] Mikhail Khludnev commented on SOLR-12798: - Is there Solr Cell ticket to move from those fancy {{literal.foo}} param to solr docs/fields format? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, SOLR-12798.patch, no params in url.png, > solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633256#comment-16633256 ] Karl Wright commented on SOLR-12798: [~elyograg] {quote} I would suggest that you don't do this. At all. Tika is prone to OOM and JVM crashes, as Julien Massiera already noted. {quote} It's not a very good citizen running inside ManifoldCF either. We have ability to use the external service version but really that just offshores the problem. But I agree it's better to keep user-facing services alive if one can. For backwards compatibility reasons, we will need to continue to support this mode of operation, but we'll recommend against it, and change our defaults accordingly as well. FWIW, we've been steadily pushing tickets into the Tika queue and issues are getting addressed. That's really the best long-term solution. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, SOLR-12798.patch, no params in url.png, > solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633252#comment-16633252 ] Karl Wright commented on SOLR-12798: [~mkhludnev] Ugly hack has been voted on and shipped. Hopefully by next round (December) there's a better way though. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, SOLR-12798.patch, no params in url.png, > solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633241#comment-16633241 ] Mikhail Khludnev commented on SOLR-12798: - After thinking some time, I'd agree with [~noble.paul]. Params are supposed to be meta info, which is supposed to be short and more static. Whereas payload is big and changes everytime. Current manifold's approach (even we fix multiparts) doesn.t let to pass many docs with params attached to each other. Imho it proves design flaw. [~daddywri] can you go ahead with ugly workaround and lately migrate to passing meta via fields? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, SOLR-12798.patch, no params in url.png, > solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633140#comment-16633140 ] Lucene/Solr QA commented on SOLR-12798: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 9s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green} 2m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Check forbidden APIs {color} | {color:green} 2m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Validate source patterns {color} | {color:green} 2m 15s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 55s{color} | {color:green} solrj in the patch passed. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 16m 33s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | SOLR-12798 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12941838/SOLR-12798.patch | | Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns | | uname | Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | ant | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh | | git revision | master / 964cc88 | | ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 | | Default Java | 1.8.0_172 | | Test Results | https://builds.apache.org/job/PreCommit-SOLR-Build/194/testReport/ | | modules | C: solr/solrj U: solr/solrj | | Console output | https://builds.apache.org/job/PreCommit-SOLR-Build/194/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, SOLR-12798.patch, no params in url.png, > solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632114#comment-16632114 ] Jan Høydahl commented on SOLR-12798: bq. should be what SolrJ *always* does when it's asked to do POST, so URL limits aren't exceeded no matter what gets thrown at it. Sounds like a good thing to secure robustness. Also the /admin/metrics bug that recently surfaced would benefit if we prefer POST over GET in general more places > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631860#comment-16631860 ] Shawn Heisey commented on SOLR-12798: - bq. How do you suggest we handle binary data that is meant for SolrCell? I would suggest that you don't do this. At all. Tika is prone to OOM and JVM crashes, as [~julienFL] already noted. When this happens in SolrCell, Solr goes down too. So it's strongly recommended for all users to never use SolrCell in production, which in my opinion means that MCF should not be using SolrCell. Tika should be separate, so if it explodes, the Solr server keeps running. That said... I think support for multi-part POST should be first class in SolrJ, and I would even say that sending separate parts for parameters and the actual body should be what SolrJ *always* does when it's asked to do POST, so URL limits aren't exceeded no matter what gets thrown at it. And we need to make sure that multi-part handling on the server side is rock-solid. (I'm not suggesting there's any problems there ... but if any are found, they need attention) It's probably a good idea to support multiple *data* streams as well in SolrJ. This would probably require some changes on the server side, and a separate Jira issue. If MCF creates SolrInputDocument objects, it can put everything there. MCF wouldn't need to be concerned about format (the JSON mentioned earlier), only one POST part is required, URL parameters are not needed, and the standard /update handler can be used, even without a change for this issue. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631644#comment-16631644 ] Mikhail Khludnev commented on SOLR-12798: - [~noble.paul] I'm not sure why we need to revamp handlers which expect content stream to manage them to read doc fields. The other concern is that now the request with single content stream works like a mine field, when one adds too long params it blows surprisingly. Always stripping params from POST urls make it way more predictable for users. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631608#comment-16631608 ] Julien Massiera commented on SOLR-12798: Assuming I have a PDF file which contains an image that can be "OCRized". I have a process that sends the PDF to a Tika server that will extract the metadata of the PDF file + the text extracted from the image thanks to Tesseract. At the end of the Tika job, the process retrieve two elements : a list of metadata as an arraylist and a file containing the text extracted from the image inside the PDF file. Now, to the metadata list I add the ACLs of the PDF file (which are hudge) and I need the metadata and the file to be sent as one document to Solr for indexation. What are you recommendations in term of code to do this in the most efficient way (in term of memory consumption and performances of course), using SolrJ ? And which handler would you use on Solr side ? I will test it and see if I experience the URL limit issue > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631581#comment-16631581 ] Jan Høydahl commented on SOLR-12798: {quote}here you can see how ManifoldCF accomplish content stream blob with long params {quote} This code is for posting to ExtractingHandler, and contains a limited amount of literal metadata, unless the ACLs are huge, which I suppose they may very well be. And that would warrant the multi-part requirement in itself. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631565#comment-16631565 ] Noble Paul commented on SOLR-12798: --- [~mkhludnev] we will post the data as follows {code:java} { "docs" :[ {"params:{ "a":"b","c":"d"}, "payload" : "" }, {"params:{ "p":"q","r:"s"}, "payload" : "" } ] } {code} On the serverside, we unmarshal the params first and then read the pay load stream > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631548#comment-16631548 ] Mikhail Khludnev commented on SOLR-12798: - [~janhoy], here you can see how ManifoldCF accomplish content stream blob with long params. https://github.com/apache/manifoldcf/blob/11f8021c22c7fc141d237970b713b197992b5921/connectors/solr/connector/src/main/java/org/apache/manifoldcf/agents/output/solr/HttpPoster.java#L1224 You can see particular params attached. I just replying your question literally, regardless of my (lack of) understanding nor opinion regarding this design. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631520#comment-16631520 ] Jan Høydahl commented on SOLR-12798: {quote}How we can avoid multipart when we have one big chunk as a content stream and one chunk with huge params? {quote} I have still not seen the usecase for this. Why would there be huge params when you post a binary content stream to SolrCell? The params would come from the metadata inside the binary docs, which are unpacked on the Solr server side? You could of course have large metadata about a PDF sitting in a database on the client and want to post that with the binary doc but as I understand the usecase for MCF, the huge metadata is parsed from the binary doc by Tika on the server side? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631514#comment-16631514 ] Mikhail Khludnev commented on SOLR-12798: - [~noble.paul] not sure I follow. How we can avoid multipart when we have one big chunk as a content stream and one chunk with huge params? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631502#comment-16631502 ] Noble Paul commented on SOLR-12798: --- The ideal fix is to avoid multipart altogether because we support both ends of the communication. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631430#comment-16631430 ] Mikhail Khludnev commented on SOLR-12798: - Ok. turns out to trigger passing params as a part of multipart request one needs to pass at least _two_ _named_ streams. Here [^SOLR-12798-workaround.patch]. [~kwri...@metacarta.com] would you mind to evaluate a quick workaround, after binary payload is added as a content stream, can it add a named add-nothing stream as well like in the patch below? {code} up.addContentStream(new ContentStreamBase.StringStream("") { { setName("multipart trigger. SOLR-12798"); } }); {code} Regarding the more or less appropriate fix: should we pass params as multipart with POST always? or try to estimate their size, and put so only long one? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, > SOLR-12798-workaround.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631337#comment-16631337 ] Noble Paul commented on SOLR-12798: --- An ideal solution would be * Be able to construct a SolrInputDocument with a binary payload + metadata parameters for that doc * When this is sent to Solr, SolrJ should sent the payload+parameters in the body * This ensures that the query string length is always constant * This also helps in inter-node communication where the documents are sent between replicas I'm not sure if we can achieve this without some changes at the server side too. Meanwhile we may need a custom HttpSolrClient implementation that can do a multipart request > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, no params in url.png, > solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631211#comment-16631211 ] Michael Schumann commented on SOLR-12798: - I wanted to chime in here because we have run into the problem of the body of large POST requests getting encoded in the URL in a different scenario and it would be nice if there was a solution for this. To work around the problem we have had to copy and modify Solr classes. Our use case is not a common one: we sometimes make query requests to a custom handler with a very large number of integer values encoded into a RoaringBitMap. On the client side it is not a big problem, we created a subclass of {{HttpSolrClient.Builder}} that set {{UseMultiPartPost}} to true. This is passed in to the {{LBHttpSolrClient}} which in turn is passed into {{CloudSolrClient}}. The problem that was harder to solve was in the {{HttpShardHandler}} on the Solr nodes, which ends up encoding the parameters in the URL. The work around we came up with was to duplicate and modify {{HttpShardHandler}} so we could again set {{UseMultiPartPost}} to true. We also had to subclass {{HttpShardHandlerFactory}} and {{HttpSolrClient.Builder.}} It would be great if there was a way to force the request both on the Solrj client side and in the requests made between the nodes to use multipart requests. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, no params in url.png, > solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631149#comment-16631149 ] Karl Wright commented on SOLR-12798: [~janhoy]: {quote} That would be for case 1) where you don't do Tika stuff on the MCF side but want Solr to handle the binary stream. In this case there should be no problem with huge metadata request params. And I agree that SolrJ should support this case (ContentStreamUpdateRequest?). {quote} Ok. At the moment that sort of request seems to be transmitted with standard POST with metadata stuffed into the URL. So a fix is needed for that. {code} I got confused by your other use case where you parse the file with Tika on the MCF side and still sent the text to /extract {code} While Julien has a custom Solr handler, that's not what we typically do, and we recommend that already-Tika-extracted content and metadata be sent to the /update handler. In that case, we build a SolrInputDocument from the content stream, and add it into an UpdateRequest. This mode of usage also seems to use standard POST or even PUT, and it puts all the metadata parameters on the URL. This is transmitted to the /update handler. Do you want to support the case where the metadata parameters are sizable enough that the URL exceeds 8192 bytes? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, no params in url.png, > solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631127#comment-16631127 ] Julien Massiera commented on SOLR-12798: [~janhoy], Your proposal could resolve the long URL problem , but how would you create a Solr document (XML, JSON or CSV cause if I am not wrong, these are the only three formats that the update handler of Solr can manage) based on some metadata and a content file (which in my case is pure text) without having to entirely read the content file to inject it to the Solr document ? I think it will have hudge performance impact when one have to crawl millions of documents if not billions The Solr Output connector of MCF is currently just constructing a simple POST request with document metadata as parameters and the content file as stream. Your solution will add a significant step before sending the document. Am I wrong ? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, no params in url.png, > solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631107#comment-16631107 ] Jan Høydahl commented on SOLR-12798: {quote}How do you suggest we handle binary data that is meant for SolrCell? {quote} That would be for case 1) where you don't do Tika stuff on the MCF side but want Solr to handle the binary stream. In this case there should be no problem with huge metadata request params. And I agree that SolrJ should support this case ({{ContentStreamUpdateRequest}}?). I got confused by your other use case where you parse the file with Tika on the MCF side and still sent the text to /extract. As I understand, this Jira issue is really mainly about the classic use case where you do NOT invoke Tika on client side but stream binary content to SolrCell and still need some Url parameters, and doing this in SolrJ is broken somehow. In this case there will NOT be huge metadata to pass as URL parameters, right? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, no params in url.png, > solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631084#comment-16631084 ] Karl Wright commented on SOLR-12798: [~janhoy], if you didn't mean that the metadata and content should be sent in the content body, then I'm completely missing what your suggestion is. {quote} My cURL examples were just to discus what "metadata" might mean in this context. {quote} Repositories that are crawled by ManifoldCF have documents that are represented as follows: - A long content stream, binary - N pairs of name/value data, called metadata, which is fielded data associated with the document If the metadata is extracted in a ManifoldCF pipeline from the content stream, it's done via Tika, from a binary stream, which changes the binary content stream to a simple text stream, and also supplies more metadata generated as a result of the extraction. In other words, your JSON example is not like anything we do at all at this time. If you want this translated into CURL, you can do it one of two ways: (1) Put the metadata onto the URL as & parameters, e.g. name1=value1=value2 etc, or (2) Send the metadata as sections in a multipart post. This too can be set up in CURL if you want me to propose an example. Each section in a multipart post has a name, and you can thus transmit a section for every metadata name/value pair, as well as one for the content part (which has its own name, that is in fact used by SolrCell for metadata of its own.) Hope this helps. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, no params in url.png, > solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631066#comment-16631066 ] Jan Høydahl commented on SOLR-12798: {quote}so your suggestion is to use JSON format {quote} Not at all. My cURL examples were just to discus what "metadata" might mean in this context. In a pure type-2) case where Tika runs in MCF one would construct documents with all metadata as fields in those documents. So I still don't understand why/how you'd get those long URLs at all in this scenario, since all the content goes into the streamed body. But I have not tested this streaming fashion use of SolrJ myself, I have just compiled in-memory SolrInputDocuments as usual and understand that you want to be memory efficient here and stream those docs as far as possible. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, no params in url.png, > solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631059#comment-16631059 ] Karl Wright commented on SOLR-12798: [~janhoy], so your suggestion is to use JSON format for the body, and put the metadata into that. How do you suggest we handle binary data that is meant for SolrCell? Encoding the binary in a JSON document is possible but in practice this is quite verbose, yielding 3 or 4 bytes to one. Is that nevertheless your official suggestion? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631057#comment-16631057 ] Karl Wright commented on SOLR-12798: [~mkhludnev], your walkthrough in the code is fine but (a) when we use ContentStreamUpdateHandler in the manner you describe to the update/extract handler, we still wind up going through the contentWriter clause above where you stop, and (b) when we use UpdateHandler in the manner you describe we also go through that same path. In fact I could find no way to send the content through any other path with the code as it exists in master right now, because in our usage there's always a contentWriter and the check for its presence excludes all else that happens after that. So I don't understand where the disconnect is. Perhaps if you attach the exact code you are testing we can resolve this. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631041#comment-16631041 ] Mikhail Khludnev commented on SOLR-12798: - Karl, fwiw {{SolrExampleTests.testMultiContentStreamRequest()}} bypasses the code path you pointed me on. I still not fully understand, but why don't pass all it needs via {{ContentStreamUpdateRequest.addFile()}} and {{.setParam()}} instead of {{ContentWriter}}? I've checked that long {{wparams}} encoded and passed as a separate part keeping URL short. !no params in url.png! > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, no params in url.png, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630999#comment-16630999 ] Jan Høydahl commented on SOLR-12798: If by "metadata" you mean the {{=value}} http parameters that the ExtractingRequestHandler expects, then why would you send those on a normal update request containing a SolrInputDocument with all fields embedded? I.e. instead of this (which does not even make sense since JSON update handler does not support literal param) {code:java} curl -XPOST http://localhost:8983/solr/foo/update?literal.id=1=George=Hello{code} you post all metadata as fields in the body: {code:java} curl -XPOST http://localhost:8983/solr/foo/update -H "Content-type: application/json" -d '[{"id":1", "author":"George", "title":"Hello"}]'{code} > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630851#comment-16630851 ] Karl Wright commented on SOLR-12798: Please examine the following code from master HttpSolrClient.java: {code} if(contentWriter != null) { String fullQueryUrl = url + wparams.toQueryString(); HttpEntityEnclosingRequestBase postOrPut = SolrRequest.METHOD.POST == request.getMethod() ?new HttpPost(fullQueryUrl) : new HttpPut(fullQueryUrl); postOrPut.addHeader("Content-Type", contentWriter.getContentType()); postOrPut.setEntity(new BasicHttpEntity(){ @Override public boolean isStreaming() { return true; } @Override public void writeTo(OutputStream outstream) throws IOException { contentWriter.write(outstream); } }); return postOrPut; } else if (streams == null || isMultipart) { {code} The request is formed by taking all the parameters in wparams (which include the metadata fields AFAICT) and putting them into the URL: {code} HttpEntityEnclosingRequestBase postOrPut = SolrRequest.METHOD.POST == request.getMethod() ?new HttpPost(fullQueryUrl) : new HttpPut(fullQueryUrl); {code} There is no other way in the SolrJ request handling code for PUT and POST requests to transmit metadata to Solr. Indeed, right now, both documents added to an UpdateRequest, as well as documents that are specified via ContentStreamUpdateRequest, go by this route. We did verify that using the 7.5.0 version of SolrJ and completely removing all ManifoldCF custom code led to documents that would exceed the maximum URL length if their metadata was long enough. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630625#comment-16630625 ] Mikhail Khludnev commented on SOLR-12798: - I'm trying to understand what's problem. Giving that the challenge is to send a huge file in body and long param. I took the test: [https://github.com/apache/lucene-solr/blob/c587410f99375005c680ece5e24a4dfd40d8d3eb/solr/solrj/src/test/org/apache/solr/client/solrj/SolrExampleTests.java#L675] added long param into: {{up.setParam(CommonParams.HEADER_ECHO_PARAMS, CommonParams.EchoParamStyle.ALL.toString());}} {{ { // added long param}} {{ StringBuilder sb = new StringBuilder();}} {{ for(int i=0; i<1000; i++) {}} {{ sb.append((char)('a'+((char)(i%26;}} {{ }}} {{ String longparam = sb.toString();}} {{ //System.out.println(longparam.length());}} {{ up.setParam("b", longparam);}} {{ }}} {{ up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);}} Then I run SolrExampleJettyTest and it passed. Is it possible if Manifold request by the same way? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630605#comment-16630605 ] Julien Massiera commented on SOLR-12798: [~janhoy], considering the discussion thread, I don't think that having us send you what we do will convince you that we do it the proper way. I think it would be more helpful for us if you show us the SolrJ code that you envision in order to create a Solr document with some content and some metadata, and stream it to Solr via POST method. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630547#comment-16630547 ] Jan Høydahl commented on SOLR-12798: {quote}you will note that all parameters and metadata are folded into the URL for the ContentWriter transmission mechanism {quote} I don't get it. What parameters and metadata are we talking about here, that you wish to send to Solr's standard {{/update}} handler? All the document fields and metadata would go in the POST body, not? Please give an example of this type 2) request. Does not need to be an example with a large request, just any request using MCF's Tika component and then how things look like when attempting to POST that content to Solr's {{/update}} endpoint. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630531#comment-16630531 ] Karl Wright commented on SOLR-12798: {quote} This looks to me like a plain Solr document post to /update handler, in whatever format you'd like? If you can take adavantage of Noble Paul's enhancements to stream the content this can still be a plain document not needing multipart, and no need sending data in http params? {quote} The streaming part is great. But if you look at the current master implementation of HttpSolrClient, you will note that all parameters and metadata are folded into the URL for the ContentWriter transmission mechanism. This fails for us because the URL size can easily exceed 8192 bytes. That is why we need the multipart post handling even for UpdateRequest/SolrInputDocument requests. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630508#comment-16630508 ] Jan Høydahl commented on SOLR-12798: Ok, let's keep the discussion about standard handlers. Then when MCF is not going to stream a huge binary file to Solr but rather send one Solr document with one potentially huge plain-text content field and several other metadata fields. This looks to me like a plain Solr document post to /update handler, in whatever format you'd like? If you can take adavantage of Noble Paul's enhancements to stream the content this can still be a plain document not needing multipart, and no need sending data in http params? However, if you have a use case where you both need to post some binary blob to Solr Cell and also need to pass huge metadata in literal params, then things would be different. But I have not seen such a usecase yet? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630406#comment-16630406 ] Julien Massiera commented on SOLR-12798: [~kwri...@metacarta.com], [~janhoy], actually the provided example IS of type 2), as I mentioned, the handler used on Solr side is a modified /update handler, not an /extract, the name is misleading I would have renammed it as /update/no-tika and here is its declaration in the solrconfig.xml file : {code:java} true ignored_ ignored_ ignored_ datafari {code} It is not using Tika and understands literal.xxx parameters, so, from my point of view, no need to discuss about this... > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630203#comment-16630203 ] Karl Wright commented on SOLR-12798: [~janhoy], the example we provided is using type (1), as Julien noted. Do you want a type (2) example? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630151#comment-16630151 ] Jan Høydahl commented on SOLR-12798: {quote}typically for case (2) the /update handler is used, not the /update/extract handler. {quote} But the /update handler does not support the {{literal.xxx}} parameters, so that makes no sense, does it? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630133#comment-16630133 ] Karl Wright commented on SOLR-12798: Hi [~janhoy], typically for case (2) the /update handler is used, not the /update/extract handler. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630027#comment-16630027 ] Julien Massiera commented on SOLR-12798: Hi [~janhoy], The /extract update handler is misleading in the case of the log I shared with you as it is not the original update/extract of Solr but a custom one that is NOT using Tika. Because, like you said, the document has already been parsed by the Tika of MCF. The 1) that you mentioned is definitely not a recommended solution in a production environment cause, till now, I experienced a lot of OOM when Tika has to deal with exotic files. As we use Solr as the search engine and we cannot afford to have an interruption of service when indexing phase, this proposal is not an option. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629971#comment-16629971 ] Jan Høydahl commented on SOLR-12798: If I understand correctly, you now have a choice in MCF whether to # Stream the original binary document to Solr's extracting request handler and use Solr's built-in Tika to parse it. In this case there will NOT be a problem since you won't have much metadata as request params, just the few you would have configured statically # Let MCF do the Tika conversion using Tika Content Extractor ([https://manifoldcf.apache.org/release/release-1.10/en_US/end-user-documentation.html#tikaextractor) I|https://manifoldcf.apache.org/release/release-1.10/en_US/end-user-documentation.html#tikaextractor)]n this case MCF will have all the various metadata parsed from the docs, that it may want to send to Solr, alongside the plain-text parsed version of the document. For 1) you don't have an issue, as you send the binary stream to /extract endpoint. For 2) I wonder why you use {{/extract}} at all, since Tika has already been invoked on the MCF side. This seems like an anti-pattern. The best way would be to construct a SolrInputDocument on where each {{literal.xyz}} params becomes a separate {{xyz}} field, and where the text body is put into a {{content}} field (configurable) and everything is sent to {{/update}} as opposed to {{/extract}}. In the case of jpg files the body text would of course be empty as there is only metadata to be indexed. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629864#comment-16629864 ] Karl Wright commented on SOLR-12798: I've attached a patch, not meant to be applied, which shows the general approach I'd like to explore for a fix. The biggest problems I've had in making this stuff work is figuring out when multipart ought to be used in the HttpSolrClient code. I therefore propose that there be an explicit METHOD type created for multipart post, and that HttpSolrClient pay attention to that when assembling its payload. The payload would be assembled solely using the ContentWriter mechanism, but the metadata would go into multipart form fields rather than the URL. The patch does not contain the modifications to HttpSolrClient yet; I just wanted to initiate the discussion. Does anyone see a problem with this? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, > SOLR-12798-approach.patch, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629860#comment-16629860 ] Karl Wright commented on SOLR-12798: I should also note that other prime examples of this issue *cannot* be added to this ticket for security reasons. Most of ManifoldCF's clients are integrators; they don't generally have permission to include company content without obtaining specific company permission. Luckily FranceLabs has a few examples hanging around or it would be a real challenge to put together a real-world example for you guys. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629853#comment-16629853 ] Karl Wright commented on SOLR-12798: {quote} Specifically the example that generates meaningful metadata and body (multipart) both of which are ending-up used in Solr. {quote} The data has now been provided, and the Solr [INFO] log line for it as well. Are you still asking for the multipart request that *should* be generated by SolrJ for that request? As I've stated, we have had to modify chunks of SolrJ in order to generate that multipart request; with some work we can probably capture it in an HttpClient wire log, but it *is* some work. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629146#comment-16629146 ] Julien Massiera commented on SOLR-12798: Hi [~noble.paul], [~arafalov], [~kwri...@metacarta.com] I am a ManifoldCF user/committer and you will find as attached files an example of an update request that is sent to Solr after being analyzed by Tika (solr-update-request.txt) and the corresponding original file. I also have an entity extractor that produce a lot of metadata on files that exceed the URL limits. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Attachments: HOT Balloon Trip_Ultra HD.jpg, solr-update-request.txt > > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628912#comment-16628912 ] Noble Paul commented on SOLR-12798: --- bq.How many examples do you need to convince yourselves that we're not making this up? Looks like you don't understand the objectives here. We design SolrJ client and server with certain usecases in mind. While doing that we assume that we meet the needs of most/all users. The fact that you had to implement a custom client suggests that either we have failed in that or you have failed in understanding how SolrJ works . I'm sure you wouldn't open a ticket to waste our time. We have also come across so many cases were users are "holding it wrong" . That is why a specific example is useful. If we realize that there is a genuine use case that cannot be satisfied by the state-of-the-art SolrJ client, we will work towards improving our code so that you don't have to do the dirty work. The objective of Solr is not to support multipart form posts . It is designed to send in docs/commands and get out query results. The multipart mechanism is just a means to an end. Imagine, Solr working on a non HTTP standard. In that case we still need to support all these use cases. So, please be patient if we are trying to get details > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628903#comment-16628903 ] Alexandre Rafalovitch commented on SOLR-12798: -- Karl, we totally believe you that it is happening. We just don't have enough knowledge about your use cases to easily visualize our side of it. I think one or two simple examples would be sufficient, no need to do an all-point. Clearly, even though your use-case was working for a long time, we somehow missed it in our tests/reasoning. So, this discussion is explicitly trying to do better on it than the last time. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628879#comment-16628879 ] Karl Wright commented on SOLR-12798: {quote} The data may be generic, but it has to be fed into Solr in one of the accepted parameters. {quote} Um, this stuff has been working for more than a decade. Yes, we're using accepted parameters. {quote} This reason why we insist on an example is because we want to know which parameters are sent as part of query string. {quote} Ok, if that's what you need, I will put out an all points bulletin on the ManifoldCF user list for a Solr INFO message that contains an example of long metadata. How many examples do you need to convince yourselves that we're not making this up? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628827#comment-16628827 ] Noble Paul commented on SOLR-12798: --- bq.there's no general answer to that question, because there's no one definitive example of metadata. The data may be generic, but it has to be fed into Solr in one of the accepted parameters. This reason why we insist on an example is because we want to know which parameters are sent as part of query string. We also want to find out if you are using it wrong > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628813#comment-16628813 ] Alexandre Rafalovitch commented on SOLR-12798: -- [~kwri...@metacarta.com] I am with Shalin on this. While I appreciate that MCF (which we do refer people to from Solr) is very general framework, I think it would be very useful to have a concrete sample example that shows what kind of information actually goes to the wire. Specifically the example that generates meaningful metadata and body (multipart) both of which are ending-up used in Solr. This would really help us to visualize the kind of use-cases, that are very obvious to your project. The link example was about forcing multipart, so was not quite representative. Similarly, Tika generates one part with all parameters. An example that has 2 (3?) meaningful parts would be most helpful I feel. And maybe even something that could go into a Solr test (so does not need to be very long, just truly multipart). > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628630#comment-16628630 ] Karl Wright commented on SOLR-12798: [~shalinmangar], there's no general answer to that question, because there's no one definitive example of metadata. I refer you to the project page for ManifoldCF here: https://manifoldcf.apache.org/en_US/index.html#What+Is+Apache+ManifoldCF%3F > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628586#comment-16628586 ] Shalin Shekhar Mangar commented on SOLR-12798: -- [~kwri...@metacarta.com] - One thing that wasn't very clear to me reading through the issue description and comments is what's the metadata for and why is it supposed to go through the request URL? I'd appreciate if you can give an example of the metadata for my understanding. Thanks! > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628354#comment-16628354 ] Karl Wright commented on SOLR-12798: Ok, thanks for the clarification. I will propose SolrJ changes to allow multipart form transport as a first-class citizen, using the ContentWriter construct, and attach those as a patch to this ticket. The other fixes I will propose separately. Or, if you want to tackle this, I'd be happy to hand it to you. Please let me know. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628338#comment-16628338 ] Noble Paul commented on SOLR-12798: --- {quote}So there's a fix for multipart post usage? Is this committed to master? How do you turn it on, or does it do this automatically? {quote} I never bothered with multipart post. I wanted to ensure that we don't write the docs to memory before we post to the server. That's the fix. As long as you can generate docs in a streaming fashion there is no limit to the no:of docs that we can write in a single request in the client > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628307#comment-16628307 ] Karl Wright commented on SOLR-12798: {quote} this no longer is the case {quote} That's good news; I can change things in ManifoldCF accordingly, since we no longer have to enforce a maximum document size limit in that case then. {quote} I have fixed this problem in the current SolrJ {quote} So there's a fix for multipart post usage? Is this committed to master? How do you turn it on, or does it do this automatically? Once that's there, it would be straightforward to add my other fixes; I'm a Lucene/Solr committer now as well, so I can ticket and propose them and they will get done this time. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628265#comment-16628265 ] Noble Paul commented on SOLR-12798: --- bq. here are two problems with using UpdateRequest. First, as you point out, the entire document has to hit memory. this no longer is the case. The reason why I changed the interface is to ensure that we don't write everything to memory .You can provide a request that creates documents on the fly and the memory consumption is trivial. bq.Yes, of course it can, but the way SolrJ is constructed it makes no use of this. In fact, it currently doesn't use multipart post at all, unless I override much functionality in order to force it to do so. I have fixed this problem in the current SolrJ > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627826#comment-16627826 ] Karl Wright commented on SOLR-12798: [~dsmiley], whereas it doesn't seem to have been appreciated, SolrJ did have reasonable support for multipart post some few major version ago but I appreciate the fact that this is no longer a priority. I'm happy to help get this back to a point that MCF needs. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627718#comment-16627718 ] David Smiley commented on SOLR-12798: - Okay I appreciate your points. It'd be nice if SolrJ could be enhanced to support multi-part to avoid long URL construction. Please help make this a first class supported feature if it matters to you/ManifoldCF. I don't think this is a bug though, and thus not a regression. Before 7.5 by your account you really had to go out of your way to make multi-part work with SolrJ. The internals changed which thwarted your efforts (a shame) but doesn't represent a bug. I appreciate it's a frustrating unexpected turn of events, nonetheless. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627698#comment-16627698 ] Karl Wright commented on SOLR-12798: [~dsmiley], there are two problems with using UpdateRequest. First, as you point out, the entire document has to hit memory. This is problematic because sometimes these documents are massive and nevertheless Tika needs all of them to extract stuff from them. So we allow two modes of operation: (1) Via Solr Cell, in which case we use ContentStreamUpdateRequest, which embeds a stream and forms the request without having the entire document hit memory, and (2) Via UpdateRequest, and SolrinputDocument, but only after Tika has been invoked, and with a length limit. Even then we have problems with people running out of memory unless they are very careful, given that there are sometimes dozens of indexing requests active at any one time. This information, by the way, has nothing to do with length limits on the URL, since those are determined solely by metadata, which can be large and is independent of the main content stream. URL limits get in the way just as readily when we use mode (2) as when we use mode (1). > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627674#comment-16627674 ] David Smiley commented on SOLR-12798: - [~arafalov] I believe this issue is about SolrJ (client-side), not the Solr server. [~kwri...@metacarta.com] why must ManifoldCF rely on HTTP Multipart in particular – can't it compose a SolrInputDocument and just send it like basically all Solr clients I've ever seen? Is the issue about the "infinite length" content stream, which I presume maps to some sort of body text? Note the existence of {{UpdateRequest.setDocIterator(Iterator)}} which can be helpful in streaming and materializing documents on the fly. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627589#comment-16627589 ] Alexandre Rafalovitch commented on SOLR-12798: -- As far as I understand, we do advertise multipart upload in several places: * [https://lucene.apache.org/solr/guide/7_5/content-streams.html#content-stream-sources] * multipartUploadLimitInKB parameter in solrconfig.xml [https://lucene.apache.org/solr/guide/7_5/requestdispatcher-in-solrconfig.html#requestparsers-element] If the current issue changes those expectations without explicitly discussing them and including the new user-visible limitation in the migration guide for 7.5, we have a clear case of critical regression on our hands. I cannot see any tests in the codebases for this, but - if my assessment is correct - perhaps one should exist. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627547#comment-16627547 ] Karl Wright commented on SOLR-12798: [~noble.paul] 'We are assuming your usecase can only be implemented using a multipart request. Can we see what do you send in the request parameters?' That's kind of a silly question if you don't mind me saying so. MCF is a framework with dozens of connectors for accessing different kinds of document repositories. A "document" in ManifoldCF consists of: - A content stream of infinite length - Unlimited metadata, in the form of name/valuelist pairs Documents that have large amounts of metadata are common. The details vary considerably by source repository. For only one example, we have one client who seemingly specializes in indexing image content. The images are run through Tika, which takes these images and produces a zero-length text file and sometimes 100K bytes of metadata text, in multiple metadata fields. I hope that's enough to demonstrate why it is impossible to expect all the metadata for a document to fit in the URL. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627408#comment-16627408 ] Noble Paul commented on SOLR-12798: --- ContentWriter is mostly implemented by anonymous classes. StringPayloadWriter is just a helper class. I'm still thinking this is an XY problem. We are assuming your usecase can only be implemented using a multipart request. Can we see what do you send in the request parameters? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627138#comment-16627138 ] Karl Wright commented on SOLR-12798: It looks like the only implementer of ContentWriter is StringPayloadContentWriter, which just furnishes a string for output, correct? In order to work within that framework, ContentStreamUpdateHandler would need a streaming ContentWriter implementation that pulls from the input and writes to the output. That seems to be missing. And then this has nothing whatsoever to do with how the content is actually transmitted -- it seems that the assumption is that the new ContentWriter stuff all goes via PUT with metadata in the URL. That's not good for two reasons: first, the URL length problems I've already mentioned, and second -- Solr Cell uses the "name" part of the multipart post to inject its own bit of metadata into the document, and there would be no way to transmit that anymore. Logic is still therefore going to be needed to use multipart forms under specific circumstances. Maybe there needs to be a useMultipart() method in all Requests, and HttpSolrClient should look at that to decide whether to use multipart or standard PUT? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627098#comment-16627098 ] Karl Wright commented on SOLR-12798: Hi [~noble.paul], as I explained before, we have document metadata in excess of the maximum URL length quite often. In fact, it's the typical case. That is why we must use multipart post in this application. My rough estimate of the percentage of ManifoldCF users who fall into this category is greater than 90%. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627038#comment-16627038 ] Noble Paul commented on SOLR-12798: --- Pardon me, I'm still wondering what is the real reason why you must use a multi part request. What is stopping you from using a standard update request with all these operations instead of using a multipart requests. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627028#comment-16627028 ] Karl Wright commented on SOLR-12798: [~noble.paul] We have a custom implementation because SolrJ and indeed HttpComponents/HttpClient have problems we're forced to work around. These have been raised before but so far not taken too seriously apparently. The need to workaround things has gotten even more significant with the latest release. ModifiedHttpSolrClient is a derivation of HttpSolrClient. The method overridden, createMethod(), is a direct copy of HttpSolrClient.createMethod() with certain very specific changes. These are apparently all still necessary. I've included the method code below. If I disable this custom method, and use standard code, I *never* get multipart form posts at all. That is unacceptable in this application. With the current modifications included below, I get multipart posts for everything, including for deletions, which breaks because Solr doesn't like that. I'm asking for advice as to how to get multipart posts only for documents, either ones transmitted by ContentStreamUpdateHandler or UpdateHandler.add(SolrInputDocument). {code} @Override protected HttpRequestBase createMethod(SolrRequest request, String collection) throws IOException, SolrServerException { if (request instanceof V2RequestSupport) { request = ((V2RequestSupport) request).getV2Request(); } SolrParams params = request.getParams(); RequestWriter.ContentWriter contentWriter = requestWriter.getContentWriter(request); Collection streams = contentWriter == null ? requestWriter.getContentStreams(request) : null; String path = requestWriter.getPath(request); if (path == null || !path.startsWith("/")) { path = DEFAULT_PATH; } ResponseParser parser = request.getResponseParser(); if (parser == null) { parser = this.parser; } // The parser 'wt=' and 'version=' params are used instead of the original // params ModifiableSolrParams wparams = new ModifiableSolrParams(params); if (parser != null) { wparams.set(CommonParams.WT, parser.getWriterType()); wparams.set(CommonParams.VERSION, parser.getVersion()); } if (invariantParams != null) { wparams.add(invariantParams); } String basePath = baseUrl; if (collection != null) basePath += "/" + collection; if (request instanceof V2Request) { if (System.getProperty("solr.v2RealPath") == null) { basePath = baseUrl.replace("/solr", "/api"); } else { basePath = baseUrl + "/v2"; } } if (SolrRequest.METHOD.GET == request.getMethod()) { if (streams != null || contentWriter != null) { throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "GET can't send streams!"); } return new HttpGet(basePath + path + toQueryString(wparams, false)); } if (SolrRequest.METHOD.DELETE == request.getMethod()) { return new HttpDelete(basePath + path + toQueryString(wparams, false)); } if (SolrRequest.METHOD.POST == request.getMethod() || SolrRequest.METHOD.PUT == request.getMethod()) { // UpdateRequest uses PUT now, and ContentStreamUpdateHandler uses POST. // We must override PUT with POST if multipart is on. // If useMultipart is on, we fall back to getting streams directly from the request. final boolean mustUseMultipart; final SolrRequest.METHOD methodToUse; if (this.useMultiPartPost) { final Collection requestStreams = request.getContentStreams(); mustUseMultipart = requestStreams != null && requestStreams.size() > 0; if (mustUseMultipart) { System.out.println("Overriding with multipart post"); streams = requestStreams; methodToUse = SolrRequest.METHOD.POST; } else { methodToUse = request.getMethod(); } } else { mustUseMultipart = false; methodToUse = request.getMethod(); } //System.out.println("Post or put"); String url = basePath + path; /* boolean hasNullStreamName = false; if (streams != null) { for (ContentStream cs : streams) { if (cs.getName() == null) { hasNullStreamName = true; break; } } } */ /* final boolean isMultipart = ((this.useMultiPartPost && SolrRequest.METHOD.POST == methodToUse) || (streams != null && streams.size() > 1)) && !hasNullStreamName; */ final boolean isMultipart = this.useMultiPartPost && SolrRequest.METHOD.POST == methodToUse && (streams != null && streams.size() >= 1); System.out.println("isMultipart = "+isMultipart); LinkedList postOrPutParams = new LinkedList<>();
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627010#comment-16627010 ] Karl Wright commented on SOLR-12798: I'm looking for workarounds -- initially, at least. What I've tried is adding the following code in the POST/PUT section of the HttpSolrClient code: {code} // UpdateRequest uses PUT now, and ContentStreamUpdateHandler uses POST. // We must override PUT with POST if multipart is on. // If useMultipart is on, we fall back to getting streams directly from the request. final boolean mustUseMultipart; final SolrRequest.METHOD methodToUse; if (this.useMultiPartPost) { final Collection requestStreams = request.getContentStreams(); mustUseMultipart = requestStreams != null && requestStreams.size() > 0; if (mustUseMultipart) { System.out.println("Overriding with multipart post"); streams = requestStreams; methodToUse = SolrRequest.METHOD.POST; } else { methodToUse = request.getMethod(); } } else { mustUseMultipart = false; methodToUse = request.getMethod(); } //System.out.println("Post or put"); String url = basePath + path; /* boolean hasNullStreamName = false; if (streams != null) { for (ContentStream cs : streams) { if (cs.getName() == null) { hasNullStreamName = true; break; } } } */ /* final boolean isMultipart = ((this.useMultiPartPost && SolrRequest.METHOD.POST == methodToUse) || (streams != null && streams.size() > 1)) && !hasNullStreamName; */ final boolean isMultipart = this.useMultiPartPost && SolrRequest.METHOD.POST == methodToUse && (streams != null && streams.size() >= 1); System.out.println("isMultipart = "+isMultipart); {code} The problem is that when multipart post is used for document delete requests, they fail because the stream is empty. And the code above doesn't distinguish between UpdateRequests that include real documents and UpdateRequests that are delete requests. Any ideas? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626970#comment-16626970 ] Noble Paul commented on SOLR-12798: --- You have custom impl of HttpSolrClient. Is it not possible for you to start using the {{ public ContentWriter getContentWriter(SolrRequest req) }} method ? The {{getContentStreams}} method is not the appropriate method to use > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626903#comment-16626903 ] Karl Wright commented on SOLR-12798: Thinking about this: I think the fix might well be to add a decent implementation of getContentStreams() to BinaryRequestWriter, and then prioritizing the use of content streams in HttpSolrClient when useMultipart is true. That would fix the basic problem, if it doesn't introduce other ones. For the ManifoldCF project's immediate release concerns, I'd have to create a ModifiedUpdateRequest class and a ModifiedBinaryRequestWriter class, if they're not locked down anyway, and use ModifiedUpdateRequest instead of UpdateRequest whenever I need to add SolrInputDocuments. I'll check out whether this would work. That makes some six SolrJ classes that ManifoldCF needs to override, however, just to get multipart post to work properly. I think it's time to make multipart post a first-class citizen for SolrJ, no? > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626079#comment-16626079 ] Karl Wright commented on SOLR-12798: [~noble.paul], I can verify that the problem still exists on Solr 7.5. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625869#comment-16625869 ] Noble Paul commented on SOLR-12798: --- Can you test with the latest SolrJ release {{7.5}} and confirm if this problem still exists. I shall take a look after that > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625809#comment-16625809 ] Karl Wright commented on SOLR-12798: The status is as follows: (1) I've confirmed that the RequestWriter override only permits multipart form requests for the "commit" request. "Update" or "Delete" both do not allow this pathway at all. (2) If I change the logic for all POST and PUT requests to disable the contentWriter clause, POST requests of documents work properly, but delete document requests fail. (4) Conditionally disabling contentWriter when the request is of class ContentStreamUpdateRequest allows things to work partly. Text documents that are indexed via standard UpdateRequest do not use multipart post, however. So we need a better solution. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625558#comment-16625558 ] Karl Wright commented on SOLR-12798: We're researching the actual issue that is blocking release. It seems that deleting documents using a Solr Cloud installation may not be working; for each document, we're seeing a 400 error with the following message: Error from server at http://localhost:8983/solr/FileShare_shard1_replica_n1: missing content stream: Error from server at http://localhost:8983/solr/FileShare_shard1_replica_n1: missing content stream Furthermore, after checking the Solr index, none of the documents have been removed. This is obviously severe and we're trying now to confirm that this happens without our modifications to HttpSolrClient. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12798) Structural changes in SolrJ since version 7.0.0 have effectively disabled multipart post
[ https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625544#comment-16625544 ] Karl Wright commented on SOLR-12798: [~noble.paul], any help would be welcome. We're in a ManifoldCF release cycle now, and SolrJ issues are blocking it. > Structural changes in SolrJ since version 7.0.0 have effectively disabled > multipart post > > > Key: SOLR-12798 > URL: https://issues.apache.org/jira/browse/SOLR-12798 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: 7.4 >Reporter: Karl Wright >Priority: Major > > Project ManifoldCF uses SolrJ to post documents to Solr. When upgrading from > SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to > SolrJ's HttpSolrClient class that seemingly disable any use of multipart > post. This is critical because ManifoldCF's documents often contain metadata > in excess of 4K that therefore cannot be stuffed into a URL. > The changes in question seem to have been performed by Paul Noble on > 10/31/2017, with the introduction of the RequestWriter mechanism. Basically, > if a request has a RequestWriter, it is used exclusively to write the > request, and that overrides the stream mechanism completely. I haven't > chased it back to a specific ticket. > ManifoldCF's usage of SolrJ involves the creation of > ContentStreamUpdateRequests for all posts meant for Solr Cell, and the > creation of UpdateRequests for posts not meant for Solr Cell (as well as for > delete and commit requests). For our release cycle that is taking place > right now, we're shipping a modified version of HttpSolrClient that ignores > the RequestWriter when dealing with ContentStreamUpdateRequests. We > apparently cannot use multipart for all requests because on the Solr side we > get "pfountz Should not get here!" errors on the Solr side when we do, which > generate HTTP error code 500 responses. That should not happen either, in my > opinion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org