Hi Karl, By the way, I've tried different versions of Solr and couldn't get content or got as I've explained. When I checkout the MFC trunk which uses Solr 6.3.0 and when I use Solr 6.3.0 as output connector I can see documents are indexed but I cannot even see "content" field.
Kind Regards, Furkan KAMACI On Fri, Nov 25, 2016 at 2:01 PM, Karl Wright <[email protected]> wrote: > Hi Furkan, > > The following code is used to set up a SolrJ object that is then later > converted to a post request: > > >>>>>> > private void buildExtractUpdateHandlerRequest( long length, > InputStream is, String contentType, > String contentName, > ContentStreamUpdateRequest contentStreamUpdateRequest ) > throws IOException > { > ModifiableSolrParams out = new ModifiableSolrParams(); > > // Write the id field > writeField(out,LITERAL+idAttributeName,documentURI); > // Write the rest of the attributes > if (originalSizeAttributeName != null) > { > Long size = document.getOriginalSize(); > if (size != null) > // Write value > writeField(out,LITERAL+originalSizeAttributeName, > size.toString()); > } > if (modifiedDateAttributeName != null) > { > Date date = document.getModifiedDate(); > if (date != null) > // Write value > writeField(out,LITERAL+modifiedDateAttributeName, > DateParser.formatISO8601Date(date)); > } > if (createdDateAttributeName != null) > { > Date date = document.getCreatedDate(); > if (date != null) > // Write value > writeField(out,LITERAL+createdDateAttributeName, > DateParser.formatISO8601Date(date)); > } > if (indexedDateAttributeName != null) > { > Date date = document.getIndexingDate(); > if (date != null) > // Write value > writeField(out,LITERAL+indexedDateAttributeName, > DateParser.formatISO8601Date(date)); > } > if (fileNameAttributeName != null) > { > String fileName = document.getFileName(); > if (!StringUtils.isBlank(fileName)) > writeField(out,LITERAL+fileNameAttributeName,fileName); > } > if (mimeTypeAttributeName != null) > { > String mimeType = document.getMimeType(); > if (!StringUtils.isBlank(mimeType)) > writeField(out,LITERAL+mimeTypeAttributeName,mimeType); > } > > // Write the access token information > // Both maps have the same keys. > Iterator<String> typeIterator = aclsMap.keySet().iterator(); > while (typeIterator.hasNext()) > { > String aclType = typeIterator.next(); > writeACLs(out,aclType,aclsMap.get(aclType),denyAclsMap.get( > aclType)); > } > > // Write the arguments > for (String name : arguments.keySet()) > { > List<String> values = arguments.get(name); > writeField(out,name,values); > } > > // Write the metadata, each in a field by itself > buildSolrParamsFromMetadata(out); > > // These are unnecessary now in the case of non-solrcloud setups, > because we overrode the SolrJ posting method to use multipart. > //writeField(out,LITERAL+"stream_size",String.valueOf(length)); > //writeField(out,LITERAL+"stream_name",document.getFileName()); > > // General hint for Tika > if (!StringUtils.isBlank(document.getFileName())) > writeField(out,"resource.name",document.getFileName()); > > // Write the commitWithin parameter > if (commitWithin != null) > writeField(out,COMMITWITHIN_METADATA,commitWithin); > > contentStreamUpdateRequest.setParams(out); > > contentStreamUpdateRequest.addContentStream(new > RepositoryDocumentStream(is,length,contentType,contentName)); > } > <<<<<< > > The ContentStreamUpdateRequest object is defined within SolrJ. Normally > this would be the end of ManifoldCF involvement, but we have also needed to > override some SolrJ classes because of bugs. So it is possible that we > could fix this behavior if the problem is within the code we have changed. > However, having said that, I am not sure that the differences you report > are significant in any way. The w3c spec for multipart HTTP requests is > what you'd want to look at for that. > > Please see ModifiedHttpMultipart.java for more details. > > Thanks, > Karl > > > On Fri, Nov 25, 2016 at 5:24 AM, Furkan KAMACI <[email protected]> > wrote: > >> Hi Karl, >> >> I used default values for Solr. At my Solr output connector "Use the >> Extract Update Handler" is clicked. Update handler is defined as: >> "/update/extract". There is no Tika content extractor defined at Job >> pipeline. >> >> I have WireShark captures and logs from both ManifoldCF and Solr. I can >> share them if you want. >> >> Kind Regards, >> Furkan KAMACI >> >> On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright <[email protected]> wrote: >> >>> Is this being indexed via the extracting update handler? What does your >>> pipeline look like? Is the tika extractor in the pipeline? >>> >>> >>> Karl >>> >>> >>> On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI <[email protected]> >>> wrote: >>> >>>> I've indexed a file via ManifoldCF to Solr which has a content starts >>>> with: >>>> >>>> *1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire" >>>> directed by Elia Kazan, 1951* >>>> >>>> *2. Portrait of Marlon Brando for "A Streetcar Named Desire" directed >>>> by Elia Kazan, 1951* >>>> >>>> *3. Portrait of Marlon Brando for "A Streetcar Named Desire" directed >>>> by Elia Kazan, 1951* >>>> >>>> However when I check Solr I see that at content: >>>> >>>> * " \n \nstream_source_info MARLON BRANDO.rtf \nstream_content_type >>>> application/rtf \nstream_size 13580 \nstream_name MARLON BRANDO.rtf >>>> \nContent-Type application/rtf \nresourceName MARLON BRANDO.rtf \n \n >>>> \n 1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\" >>>> directed by Elia Kazan \n"* >>>> >>>> There are 2 problems at here. >>>> >>>> 1) There are newline characters which are unnecessary. >>>> >>>> 2) There are metadata prepended to content field which should not be. >>>> >>>> So, one can think that problem maybe at Solr or ManifoldCF (related to >>>> Tika). When I index same document to Solr via cURL there are not new line >>>> characters or metadata prepended. >>>> >>>> What do you think about for a solution? >>>> >>>> Kind Regards, >>>> Furkan KAMACI >>>> >>>> >>> >> >
