Hi, > Everything is OK when you directly send data to Solr without MFC. How did you send files?
I just sent a pdf to Solr by curl, metadata is included to the content field value. command: $ curl 'http://localhost:8983/solr/collection1/update/extract?literal.id=doc1&commit=true&fmap.content=content_t' -F "myfile=@/path/to/file.pdf" content field value with metadata: "content_t":[" \n \n date 2016-06-27T13:15:05Z \n pdf:PDFVersion 1.4 ... The content field value indexed by Solr Cell contains the metadata strings unless using field mapping in solrconfig.xml. Shinichiro Abe 2016-11-26 21:26 GMT+09:00 Furkan KAMACI <[email protected]>: > Hi Shinichiro, > > Yes, I can see the content with that way. However, beside the new line > characters, there is metadata information prepended > to content. Everything is OK when you directly send data to Solr without > MFC. > > For example one of my content starts with it: > > \n \n stream_size 298979 \n pdf:PDFVersion 1.4 \n X-Parsed-By > org.apache.tika.parser.DefaultParser \n X-Parsed-By > org.apache.tika.parser.pdf.PDFParser \n xmp:CreatorTool Google \n > stream_content_type application/pdf \n access_permission:modify_annotations > true \n access_permission:can_print_degraded true > > I am suspicious about that the way that MFC sends data to Solr. Could you > also check it? > > Kind Regards, > Furkan KAMACI > > On Sat, Nov 26, 2016 at 2:52 AM, Shinichiro Abe <[email protected]> > wrote: >> >> Hi Furkan, >> >> Please see the previous mail[1] which may be the same issue. >> And as far as I know the new line chars will appear in any Tika >> version and you can see by json format in Solr. When you want to >> remove that, please use charfilter or updateprocessor in Solr. I think >> even when fields have new line chars, searching works, so I don't >> think it is mcf's solrj issue. >> >> >> [1]http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201610.mbox/%3CCA%2BeTv_UO5DKgza%2Bo0bVQF_i%2B8wtHdz61gP51XHu2gF3rKLn%2BMg%40mail.gmail.com%3E >> >> Shinichiro Abe >> >> 2016-11-26 4:11 GMT+09:00 Karl Wright <[email protected]>: >> > I am on vacation today and have other responsibilities. However, I >> > believe >> > Shinichiro Abe might be able to test this out. He redid the Solr >> > integration for SolrJ 6.3. >> > >> > Thanks, >> > Karl >> > >> > >> > On Fri, Nov 25, 2016 at 1:54 PM, Furkan KAMACI <[email protected]> >> > wrote: >> >> >> >> Hi Karl, >> >> >> >> Could you try to test MFC with Solr? I cannot see content field either >> >> with Windows Shares or File System with Solr 4.x, 5.x, 6.x. Only Solr >> >> 4.x >> >> have content and it is as I defined. Code part of sending content as a >> >> stream may have some problems. >> >> >> >> Kind Regards, >> >> Furkan KAMACI >> >> >> >> >> >> On Fri, Nov 25, 2016 at 4:13 PM, Furkan KAMACI <[email protected]> >> >> wrote: >> >>> >> >>> Hi Karl, >> >>> >> >>> By the way, I've tried different versions of Solr and couldn't get >> >>> content or got as I've explained. When I checkout the MFC trunk which >> >>> uses >> >>> Solr 6.3.0 and when I use Solr 6.3.0 as output connector I can see >> >>> documents >> >>> are indexed but I cannot even see "content" field. >> >>> >> >>> Kind Regards, >> >>> Furkan KAMACI >> >>> >> >>> On Fri, Nov 25, 2016 at 2:01 PM, Karl Wright <[email protected]> >> >>> wrote: >> >>>> >> >>>> Hi Furkan, >> >>>> >> >>>> The following code is used to set up a SolrJ object that is then >> >>>> later >> >>>> converted to a post request: >> >>>> >> >>>> >>>>>> >> >>>> private void buildExtractUpdateHandlerRequest( long length, >> >>>> InputStream is, String contentType, >> >>>> String contentName, >> >>>> ContentStreamUpdateRequest contentStreamUpdateRequest ) >> >>>> throws IOException >> >>>> { >> >>>> ModifiableSolrParams out = new ModifiableSolrParams(); >> >>>> >> >>>> // Write the id field >> >>>> writeField(out,LITERAL+idAttributeName,documentURI); >> >>>> // Write the rest of the attributes >> >>>> if (originalSizeAttributeName != null) >> >>>> { >> >>>> Long size = document.getOriginalSize(); >> >>>> if (size != null) >> >>>> // Write value >> >>>> >> >>>> writeField(out,LITERAL+originalSizeAttributeName,size.toString()); >> >>>> } >> >>>> if (modifiedDateAttributeName != null) >> >>>> { >> >>>> Date date = document.getModifiedDate(); >> >>>> if (date != null) >> >>>> // Write value >> >>>> >> >>>> >> >>>> writeField(out,LITERAL+modifiedDateAttributeName,DateParser.formatISO8601Date(date)); >> >>>> } >> >>>> if (createdDateAttributeName != null) >> >>>> { >> >>>> Date date = document.getCreatedDate(); >> >>>> if (date != null) >> >>>> // Write value >> >>>> >> >>>> >> >>>> writeField(out,LITERAL+createdDateAttributeName,DateParser.formatISO8601Date(date)); >> >>>> } >> >>>> if (indexedDateAttributeName != null) >> >>>> { >> >>>> Date date = document.getIndexingDate(); >> >>>> if (date != null) >> >>>> // Write value >> >>>> >> >>>> >> >>>> writeField(out,LITERAL+indexedDateAttributeName,DateParser.formatISO8601Date(date)); >> >>>> } >> >>>> if (fileNameAttributeName != null) >> >>>> { >> >>>> String fileName = document.getFileName(); >> >>>> if (!StringUtils.isBlank(fileName)) >> >>>> writeField(out,LITERAL+fileNameAttributeName,fileName); >> >>>> } >> >>>> if (mimeTypeAttributeName != null) >> >>>> { >> >>>> String mimeType = document.getMimeType(); >> >>>> if (!StringUtils.isBlank(mimeType)) >> >>>> writeField(out,LITERAL+mimeTypeAttributeName,mimeType); >> >>>> } >> >>>> >> >>>> // Write the access token information >> >>>> // Both maps have the same keys. >> >>>> Iterator<String> typeIterator = aclsMap.keySet().iterator(); >> >>>> while (typeIterator.hasNext()) >> >>>> { >> >>>> String aclType = typeIterator.next(); >> >>>> >> >>>> writeACLs(out,aclType,aclsMap.get(aclType),denyAclsMap.get(aclType)); >> >>>> } >> >>>> >> >>>> // Write the arguments >> >>>> for (String name : arguments.keySet()) >> >>>> { >> >>>> List<String> values = arguments.get(name); >> >>>> writeField(out,name,values); >> >>>> } >> >>>> >> >>>> // Write the metadata, each in a field by itself >> >>>> buildSolrParamsFromMetadata(out); >> >>>> >> >>>> // These are unnecessary now in the case of non-solrcloud >> >>>> setups, >> >>>> because we overrode the SolrJ posting method to use multipart. >> >>>> //writeField(out,LITERAL+"stream_size",String.valueOf(length)); >> >>>> //writeField(out,LITERAL+"stream_name",document.getFileName()); >> >>>> >> >>>> // General hint for Tika >> >>>> if (!StringUtils.isBlank(document.getFileName())) >> >>>> writeField(out,"resource.name",document.getFileName()); >> >>>> >> >>>> // Write the commitWithin parameter >> >>>> if (commitWithin != null) >> >>>> writeField(out,COMMITWITHIN_METADATA,commitWithin); >> >>>> >> >>>> contentStreamUpdateRequest.setParams(out); >> >>>> >> >>>> contentStreamUpdateRequest.addContentStream(new >> >>>> RepositoryDocumentStream(is,length,contentType,contentName)); >> >>>> } >> >>>> <<<<<< >> >>>> >> >>>> The ContentStreamUpdateRequest object is defined within SolrJ. >> >>>> Normally >> >>>> this would be the end of ManifoldCF involvement, but we have also >> >>>> needed to >> >>>> override some SolrJ classes because of bugs. So it is possible that >> >>>> we >> >>>> could fix this behavior if the problem is within the code we have >> >>>> changed. >> >>>> However, having said that, I am not sure that the differences you >> >>>> report are >> >>>> significant in any way. The w3c spec for multipart HTTP requests is >> >>>> what >> >>>> you'd want to look at for that. >> >>>> >> >>>> Please see ModifiedHttpMultipart.java for more details. >> >>>> >> >>>> Thanks, >> >>>> Karl >> >>>> >> >>>> >> >>>> On Fri, Nov 25, 2016 at 5:24 AM, Furkan KAMACI >> >>>> <[email protected]> >> >>>> wrote: >> >>>>> >> >>>>> Hi Karl, >> >>>>> >> >>>>> I used default values for Solr. At my Solr output connector "Use the >> >>>>> Extract Update Handler" is clicked. Update handler is defined as: >> >>>>> "/update/extract". There is no Tika content extractor defined at Job >> >>>>> pipeline. >> >>>>> >> >>>>> I have WireShark captures and logs from both ManifoldCF and Solr. I >> >>>>> can >> >>>>> share them if you want. >> >>>>> >> >>>>> Kind Regards, >> >>>>> Furkan KAMACI >> >>>>> >> >>>>> On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright <[email protected]> >> >>>>> wrote: >> >>>>>> >> >>>>>> Is this being indexed via the extracting update handler? What does >> >>>>>> your pipeline look like? Is the tika extractor in the pipeline? >> >>>>>> >> >>>>>> >> >>>>>> Karl >> >>>>>> >> >>>>>> >> >>>>>> On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI >> >>>>>> <[email protected]> wrote: >> >>>>>>> >> >>>>>>> I've indexed a file via ManifoldCF to Solr which has a content >> >>>>>>> starts >> >>>>>>> with: >> >>>>>>> >> >>>>>>> 1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire" >> >>>>>>> directed by Elia Kazan, 1951 >> >>>>>>> >> >>>>>>> 2. Portrait of Marlon Brando for "A Streetcar Named Desire" >> >>>>>>> directed >> >>>>>>> by Elia Kazan, 1951 >> >>>>>>> >> >>>>>>> 3. Portrait of Marlon Brando for "A Streetcar Named Desire" >> >>>>>>> directed >> >>>>>>> by Elia Kazan, 1951 >> >>>>>>> >> >>>>>>> However when I check Solr I see that at content: >> >>>>>>> >> >>>>>>> " \n \nstream_source_info MARLON BRANDO.rtf >> >>>>>>> \nstream_content_type >> >>>>>>> application/rtf \nstream_size 13580 \nstream_name MARLON >> >>>>>>> BRANDO.rtf >> >>>>>>> \nContent-Type application/rtf \nresourceName MARLON BRANDO.rtf >> >>>>>>> \n \n >> >>>>>>> \n 1. Vivien Leigh and Marlon Brando in \"A Streetcar Named >> >>>>>>> Desire\" >> >>>>>>> directed by Elia Kazan \n" >> >>>>>>> >> >>>>>>> There are 2 problems at here. >> >>>>>>> >> >>>>>>> 1) There are newline characters which are unnecessary. >> >>>>>>> >> >>>>>>> 2) There are metadata prepended to content field which should not >> >>>>>>> be. >> >>>>>>> >> >>>>>>> So, one can think that problem maybe at Solr or ManifoldCF >> >>>>>>> (related >> >>>>>>> to Tika). When I index same document to Solr via cURL there are >> >>>>>>> not new line >> >>>>>>> characters or metadata prepended. >> >>>>>>> >> >>>>>>> What do you think about for a solution? >> >>>>>>> >> >>>>>>> Kind Regards, >> >>>>>>> Furkan KAMACI >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > > >
