Hi Karl, Could you try to test MFC with Solr? I cannot see content field either with Windows Shares or File System with Solr 4.x, 5.x, 6.x. Only Solr 4.x have content and it is as I defined. Code part of sending content as a stream may have some problems.
Kind Regards, Furkan KAMACI On Fri, Nov 25, 2016 at 4:13 PM, Furkan KAMACI <[email protected]> wrote: > Hi Karl, > > By the way, I've tried different versions of Solr and couldn't get content > or got as I've explained. When I checkout the MFC trunk which uses Solr > 6.3.0 and when I use Solr 6.3.0 as output connector I can see documents are > indexed but I cannot even see "content" field. > > Kind Regards, > Furkan KAMACI > > On Fri, Nov 25, 2016 at 2:01 PM, Karl Wright <[email protected]> wrote: > >> Hi Furkan, >> >> The following code is used to set up a SolrJ object that is then later >> converted to a post request: >> >> >>>>>> >> private void buildExtractUpdateHandlerRequest( long length, >> InputStream is, String contentType, >> String contentName, >> ContentStreamUpdateRequest contentStreamUpdateRequest ) >> throws IOException >> { >> ModifiableSolrParams out = new ModifiableSolrParams(); >> >> // Write the id field >> writeField(out,LITERAL+idAttributeName,documentURI); >> // Write the rest of the attributes >> if (originalSizeAttributeName != null) >> { >> Long size = document.getOriginalSize(); >> if (size != null) >> // Write value >> writeField(out,LITERAL+originalSizeAttributeName,size. >> toString()); >> } >> if (modifiedDateAttributeName != null) >> { >> Date date = document.getModifiedDate(); >> if (date != null) >> // Write value >> writeField(out,LITERAL+modifiedDateAttributeName,DateParser. >> formatISO8601Date(date)); >> } >> if (createdDateAttributeName != null) >> { >> Date date = document.getCreatedDate(); >> if (date != null) >> // Write value >> writeField(out,LITERAL+createdDateAttributeName,DateParser. >> formatISO8601Date(date)); >> } >> if (indexedDateAttributeName != null) >> { >> Date date = document.getIndexingDate(); >> if (date != null) >> // Write value >> writeField(out,LITERAL+indexedDateAttributeName,DateParser. >> formatISO8601Date(date)); >> } >> if (fileNameAttributeName != null) >> { >> String fileName = document.getFileName(); >> if (!StringUtils.isBlank(fileName)) >> writeField(out,LITERAL+fileNameAttributeName,fileName); >> } >> if (mimeTypeAttributeName != null) >> { >> String mimeType = document.getMimeType(); >> if (!StringUtils.isBlank(mimeType)) >> writeField(out,LITERAL+mimeTypeAttributeName,mimeType); >> } >> >> // Write the access token information >> // Both maps have the same keys. >> Iterator<String> typeIterator = aclsMap.keySet().iterator(); >> while (typeIterator.hasNext()) >> { >> String aclType = typeIterator.next(); >> writeACLs(out,aclType,aclsMap.get(aclType),denyAclsMap.get(a >> clType)); >> } >> >> // Write the arguments >> for (String name : arguments.keySet()) >> { >> List<String> values = arguments.get(name); >> writeField(out,name,values); >> } >> >> // Write the metadata, each in a field by itself >> buildSolrParamsFromMetadata(out); >> >> // These are unnecessary now in the case of non-solrcloud setups, >> because we overrode the SolrJ posting method to use multipart. >> //writeField(out,LITERAL+"stream_size",String.valueOf(length)); >> //writeField(out,LITERAL+"stream_name",document.getFileName()); >> >> // General hint for Tika >> if (!StringUtils.isBlank(document.getFileName())) >> writeField(out,"resource.name",document.getFileName()); >> >> // Write the commitWithin parameter >> if (commitWithin != null) >> writeField(out,COMMITWITHIN_METADATA,commitWithin); >> >> contentStreamUpdateRequest.setParams(out); >> >> contentStreamUpdateRequest.addContentStream(new >> RepositoryDocumentStream(is,length,contentType,contentName)); >> } >> <<<<<< >> >> The ContentStreamUpdateRequest object is defined within SolrJ. Normally >> this would be the end of ManifoldCF involvement, but we have also needed to >> override some SolrJ classes because of bugs. So it is possible that we >> could fix this behavior if the problem is within the code we have changed. >> However, having said that, I am not sure that the differences you report >> are significant in any way. The w3c spec for multipart HTTP requests is >> what you'd want to look at for that. >> >> Please see ModifiedHttpMultipart.java for more details. >> >> Thanks, >> Karl >> >> >> On Fri, Nov 25, 2016 at 5:24 AM, Furkan KAMACI <[email protected]> >> wrote: >> >>> Hi Karl, >>> >>> I used default values for Solr. At my Solr output connector "Use the >>> Extract Update Handler" is clicked. Update handler is defined as: >>> "/update/extract". There is no Tika content extractor defined at Job >>> pipeline. >>> >>> I have WireShark captures and logs from both ManifoldCF and Solr. I can >>> share them if you want. >>> >>> Kind Regards, >>> Furkan KAMACI >>> >>> On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright <[email protected]> >>> wrote: >>> >>>> Is this being indexed via the extracting update handler? What does >>>> your pipeline look like? Is the tika extractor in the pipeline? >>>> >>>> >>>> Karl >>>> >>>> >>>> On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI <[email protected] >>>> > wrote: >>>> >>>>> I've indexed a file via ManifoldCF to Solr which has a content starts >>>>> with: >>>>> >>>>> *1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire" >>>>> directed by Elia Kazan, 1951* >>>>> >>>>> *2. Portrait of Marlon Brando for "A Streetcar Named Desire" directed >>>>> by Elia Kazan, 1951* >>>>> >>>>> *3. Portrait of Marlon Brando for "A Streetcar Named Desire" directed >>>>> by Elia Kazan, 1951* >>>>> >>>>> However when I check Solr I see that at content: >>>>> >>>>> * " \n \nstream_source_info MARLON BRANDO.rtf \nstream_content_type >>>>> application/rtf \nstream_size 13580 \nstream_name MARLON BRANDO.rtf >>>>> \nContent-Type application/rtf \nresourceName MARLON BRANDO.rtf \n \n >>>>> \n 1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\" >>>>> directed by Elia Kazan \n"* >>>>> >>>>> There are 2 problems at here. >>>>> >>>>> 1) There are newline characters which are unnecessary. >>>>> >>>>> 2) There are metadata prepended to content field which should not be. >>>>> >>>>> So, one can think that problem maybe at Solr or ManifoldCF (related to >>>>> Tika). When I index same document to Solr via cURL there are not new line >>>>> characters or metadata prepended. >>>>> >>>>> What do you think about for a solution? >>>>> >>>>> Kind Regards, >>>>> Furkan KAMACI >>>>> >>>>> >>>> >>> >> >
