I am on vacation today and have other responsibilities. However, I believe Shinichiro Abe might be able to test this out. He redid the Solr integration for SolrJ 6.3.
Thanks, Karl On Fri, Nov 25, 2016 at 1:54 PM, Furkan KAMACI <[email protected]> wrote: > Hi Karl, > > Could you try to test MFC with Solr? I cannot see content field either > with Windows Shares or File System with Solr 4.x, 5.x, 6.x. Only Solr 4.x > have content and it is as I defined. Code part of sending content as a > stream may have some problems. > > Kind Regards, > Furkan KAMACI > > > On Fri, Nov 25, 2016 at 4:13 PM, Furkan KAMACI <[email protected]> > wrote: > >> Hi Karl, >> >> By the way, I've tried different versions of Solr and couldn't get >> content or got as I've explained. When I checkout the MFC trunk which uses >> Solr 6.3.0 and when I use Solr 6.3.0 as output connector I can see >> documents are indexed but I cannot even see "content" field. >> >> Kind Regards, >> Furkan KAMACI >> >> On Fri, Nov 25, 2016 at 2:01 PM, Karl Wright <[email protected]> wrote: >> >>> Hi Furkan, >>> >>> The following code is used to set up a SolrJ object that is then later >>> converted to a post request: >>> >>> >>>>>> >>> private void buildExtractUpdateHandlerRequest( long length, >>> InputStream is, String contentType, >>> String contentName, >>> ContentStreamUpdateRequest contentStreamUpdateRequest ) >>> throws IOException >>> { >>> ModifiableSolrParams out = new ModifiableSolrParams(); >>> >>> // Write the id field >>> writeField(out,LITERAL+idAttributeName,documentURI); >>> // Write the rest of the attributes >>> if (originalSizeAttributeName != null) >>> { >>> Long size = document.getOriginalSize(); >>> if (size != null) >>> // Write value >>> writeField(out,LITERAL+originalSizeAttributeName,size.toStri >>> ng()); >>> } >>> if (modifiedDateAttributeName != null) >>> { >>> Date date = document.getModifiedDate(); >>> if (date != null) >>> // Write value >>> writeField(out,LITERAL+modifiedDateAttributeName,DateParser. >>> formatISO8601Date(date)); >>> } >>> if (createdDateAttributeName != null) >>> { >>> Date date = document.getCreatedDate(); >>> if (date != null) >>> // Write value >>> writeField(out,LITERAL+createdDateAttributeName,DateParser.f >>> ormatISO8601Date(date)); >>> } >>> if (indexedDateAttributeName != null) >>> { >>> Date date = document.getIndexingDate(); >>> if (date != null) >>> // Write value >>> writeField(out,LITERAL+indexedDateAttributeName,DateParser.f >>> ormatISO8601Date(date)); >>> } >>> if (fileNameAttributeName != null) >>> { >>> String fileName = document.getFileName(); >>> if (!StringUtils.isBlank(fileName)) >>> writeField(out,LITERAL+fileNameAttributeName,fileName); >>> } >>> if (mimeTypeAttributeName != null) >>> { >>> String mimeType = document.getMimeType(); >>> if (!StringUtils.isBlank(mimeType)) >>> writeField(out,LITERAL+mimeTypeAttributeName,mimeType); >>> } >>> >>> // Write the access token information >>> // Both maps have the same keys. >>> Iterator<String> typeIterator = aclsMap.keySet().iterator(); >>> while (typeIterator.hasNext()) >>> { >>> String aclType = typeIterator.next(); >>> writeACLs(out,aclType,aclsMap.get(aclType),denyAclsMap.get(a >>> clType)); >>> } >>> >>> // Write the arguments >>> for (String name : arguments.keySet()) >>> { >>> List<String> values = arguments.get(name); >>> writeField(out,name,values); >>> } >>> >>> // Write the metadata, each in a field by itself >>> buildSolrParamsFromMetadata(out); >>> >>> // These are unnecessary now in the case of non-solrcloud setups, >>> because we overrode the SolrJ posting method to use multipart. >>> //writeField(out,LITERAL+"stream_size",String.valueOf(length)); >>> //writeField(out,LITERAL+"stream_name",document.getFileName()); >>> >>> // General hint for Tika >>> if (!StringUtils.isBlank(document.getFileName())) >>> writeField(out,"resource.name",document.getFileName()); >>> >>> // Write the commitWithin parameter >>> if (commitWithin != null) >>> writeField(out,COMMITWITHIN_METADATA,commitWithin); >>> >>> contentStreamUpdateRequest.setParams(out); >>> >>> contentStreamUpdateRequest.addContentStream(new >>> RepositoryDocumentStream(is,length,contentType,contentName)); >>> } >>> <<<<<< >>> >>> The ContentStreamUpdateRequest object is defined within SolrJ. Normally >>> this would be the end of ManifoldCF involvement, but we have also needed to >>> override some SolrJ classes because of bugs. So it is possible that we >>> could fix this behavior if the problem is within the code we have changed. >>> However, having said that, I am not sure that the differences you report >>> are significant in any way. The w3c spec for multipart HTTP requests is >>> what you'd want to look at for that. >>> >>> Please see ModifiedHttpMultipart.java for more details. >>> >>> Thanks, >>> Karl >>> >>> >>> On Fri, Nov 25, 2016 at 5:24 AM, Furkan KAMACI <[email protected]> >>> wrote: >>> >>>> Hi Karl, >>>> >>>> I used default values for Solr. At my Solr output connector "Use the >>>> Extract Update Handler" is clicked. Update handler is defined as: >>>> "/update/extract". There is no Tika content extractor defined at Job >>>> pipeline. >>>> >>>> I have WireShark captures and logs from both ManifoldCF and Solr. I can >>>> share them if you want. >>>> >>>> Kind Regards, >>>> Furkan KAMACI >>>> >>>> On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright <[email protected]> >>>> wrote: >>>> >>>>> Is this being indexed via the extracting update handler? What does >>>>> your pipeline look like? Is the tika extractor in the pipeline? >>>>> >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI < >>>>> [email protected]> wrote: >>>>> >>>>>> I've indexed a file via ManifoldCF to Solr which has a content starts >>>>>> with: >>>>>> >>>>>> *1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire" >>>>>> directed by Elia Kazan, 1951* >>>>>> >>>>>> *2. Portrait of Marlon Brando for "A Streetcar Named Desire" directed >>>>>> by Elia Kazan, 1951* >>>>>> >>>>>> *3. Portrait of Marlon Brando for "A Streetcar Named Desire" directed >>>>>> by Elia Kazan, 1951* >>>>>> >>>>>> However when I check Solr I see that at content: >>>>>> >>>>>> * " \n \nstream_source_info MARLON BRANDO.rtf \nstream_content_type >>>>>> application/rtf \nstream_size 13580 \nstream_name MARLON BRANDO.rtf >>>>>> \nContent-Type application/rtf \nresourceName MARLON BRANDO.rtf \n >>>>>> \n >>>>>> \n 1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\" >>>>>> directed by Elia Kazan \n"* >>>>>> >>>>>> There are 2 problems at here. >>>>>> >>>>>> 1) There are newline characters which are unnecessary. >>>>>> >>>>>> 2) There are metadata prepended to content field which should not be. >>>>>> >>>>>> So, one can think that problem maybe at Solr or ManifoldCF (related >>>>>> to Tika). When I index same document to Solr via cURL there are not new >>>>>> line characters or metadata prepended. >>>>>> >>>>>> What do you think about for a solution? >>>>>> >>>>>> Kind Regards, >>>>>> Furkan KAMACI >>>>>> >>>>>> >>>>> >>>> >>> >> >
