Hi, If you use the following parameters, you could remove metadata infos from content field value.
$ curl 'http://localhost:8983/solr/collection1/update/extract?literal.id=doc1&commit=true&fmap.content=content_t&fmap.uprefix=ignored_&captureAttr=true&fmap.div=ignored_&fmap.a=ignored_' -F "myfile=@/path/to/file.pdf" Those parameters is in Solr 4 config, but it has been removed since Solr 5. Regards, Shinichiro Abe 2016-11-26 23:55 GMT+09:00 Shinichiro Abe <[email protected]>: > Hi, > >> Everything is OK when you directly send data to Solr without MFC. > How did you send files? > > I just sent a pdf to Solr by curl, metadata is included to the content > field value. > > command: > $ curl > 'http://localhost:8983/solr/collection1/update/extract?literal.id=doc1&commit=true&fmap.content=content_t' > -F "myfile=@/path/to/file.pdf" > > content field value with metadata: > "content_t":[" \n \n date 2016-06-27T13:15:05Z \n pdf:PDFVersion 1.4 ... > > The content field value indexed by Solr Cell contains the metadata > strings unless using field mapping in solrconfig.xml. > > Shinichiro Abe > > > > 2016-11-26 21:26 GMT+09:00 Furkan KAMACI <[email protected]>: >> Hi Shinichiro, >> >> Yes, I can see the content with that way. However, beside the new line >> characters, there is metadata information prepended >> to content. Everything is OK when you directly send data to Solr without >> MFC. >> >> For example one of my content starts with it: >> >> \n \n stream_size 298979 \n pdf:PDFVersion 1.4 \n X-Parsed-By >> org.apache.tika.parser.DefaultParser \n X-Parsed-By >> org.apache.tika.parser.pdf.PDFParser \n xmp:CreatorTool Google \n >> stream_content_type application/pdf \n access_permission:modify_annotations >> true \n access_permission:can_print_degraded true >> >> I am suspicious about that the way that MFC sends data to Solr. Could you >> also check it? >> >> Kind Regards, >> Furkan KAMACI >> >> On Sat, Nov 26, 2016 at 2:52 AM, Shinichiro Abe <[email protected]> >> wrote: >>> >>> Hi Furkan, >>> >>> Please see the previous mail[1] which may be the same issue. >>> And as far as I know the new line chars will appear in any Tika >>> version and you can see by json format in Solr. When you want to >>> remove that, please use charfilter or updateprocessor in Solr. I think >>> even when fields have new line chars, searching works, so I don't >>> think it is mcf's solrj issue. >>> >>> >>> [1]http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201610.mbox/%3CCA%2BeTv_UO5DKgza%2Bo0bVQF_i%2B8wtHdz61gP51XHu2gF3rKLn%2BMg%40mail.gmail.com%3E >>> >>> Shinichiro Abe >>> >>> 2016-11-26 4:11 GMT+09:00 Karl Wright <[email protected]>: >>> > I am on vacation today and have other responsibilities. However, I >>> > believe >>> > Shinichiro Abe might be able to test this out. He redid the Solr >>> > integration for SolrJ 6.3. >>> > >>> > Thanks, >>> > Karl >>> > >>> > >>> > On Fri, Nov 25, 2016 at 1:54 PM, Furkan KAMACI <[email protected]> >>> > wrote: >>> >> >>> >> Hi Karl, >>> >> >>> >> Could you try to test MFC with Solr? I cannot see content field either >>> >> with Windows Shares or File System with Solr 4.x, 5.x, 6.x. Only Solr >>> >> 4.x >>> >> have content and it is as I defined. Code part of sending content as a >>> >> stream may have some problems. >>> >> >>> >> Kind Regards, >>> >> Furkan KAMACI >>> >> >>> >> >>> >> On Fri, Nov 25, 2016 at 4:13 PM, Furkan KAMACI <[email protected]> >>> >> wrote: >>> >>> >>> >>> Hi Karl, >>> >>> >>> >>> By the way, I've tried different versions of Solr and couldn't get >>> >>> content or got as I've explained. When I checkout the MFC trunk which >>> >>> uses >>> >>> Solr 6.3.0 and when I use Solr 6.3.0 as output connector I can see >>> >>> documents >>> >>> are indexed but I cannot even see "content" field. >>> >>> >>> >>> Kind Regards, >>> >>> Furkan KAMACI >>> >>> >>> >>> On Fri, Nov 25, 2016 at 2:01 PM, Karl Wright <[email protected]> >>> >>> wrote: >>> >>>> >>> >>>> Hi Furkan, >>> >>>> >>> >>>> The following code is used to set up a SolrJ object that is then >>> >>>> later >>> >>>> converted to a post request: >>> >>>> >>> >>>> >>>>>> >>> >>>> private void buildExtractUpdateHandlerRequest( long length, >>> >>>> InputStream is, String contentType, >>> >>>> String contentName, >>> >>>> ContentStreamUpdateRequest contentStreamUpdateRequest ) >>> >>>> throws IOException >>> >>>> { >>> >>>> ModifiableSolrParams out = new ModifiableSolrParams(); >>> >>>> >>> >>>> // Write the id field >>> >>>> writeField(out,LITERAL+idAttributeName,documentURI); >>> >>>> // Write the rest of the attributes >>> >>>> if (originalSizeAttributeName != null) >>> >>>> { >>> >>>> Long size = document.getOriginalSize(); >>> >>>> if (size != null) >>> >>>> // Write value >>> >>>> >>> >>>> writeField(out,LITERAL+originalSizeAttributeName,size.toString()); >>> >>>> } >>> >>>> if (modifiedDateAttributeName != null) >>> >>>> { >>> >>>> Date date = document.getModifiedDate(); >>> >>>> if (date != null) >>> >>>> // Write value >>> >>>> >>> >>>> >>> >>>> writeField(out,LITERAL+modifiedDateAttributeName,DateParser.formatISO8601Date(date)); >>> >>>> } >>> >>>> if (createdDateAttributeName != null) >>> >>>> { >>> >>>> Date date = document.getCreatedDate(); >>> >>>> if (date != null) >>> >>>> // Write value >>> >>>> >>> >>>> >>> >>>> writeField(out,LITERAL+createdDateAttributeName,DateParser.formatISO8601Date(date)); >>> >>>> } >>> >>>> if (indexedDateAttributeName != null) >>> >>>> { >>> >>>> Date date = document.getIndexingDate(); >>> >>>> if (date != null) >>> >>>> // Write value >>> >>>> >>> >>>> >>> >>>> writeField(out,LITERAL+indexedDateAttributeName,DateParser.formatISO8601Date(date)); >>> >>>> } >>> >>>> if (fileNameAttributeName != null) >>> >>>> { >>> >>>> String fileName = document.getFileName(); >>> >>>> if (!StringUtils.isBlank(fileName)) >>> >>>> writeField(out,LITERAL+fileNameAttributeName,fileName); >>> >>>> } >>> >>>> if (mimeTypeAttributeName != null) >>> >>>> { >>> >>>> String mimeType = document.getMimeType(); >>> >>>> if (!StringUtils.isBlank(mimeType)) >>> >>>> writeField(out,LITERAL+mimeTypeAttributeName,mimeType); >>> >>>> } >>> >>>> >>> >>>> // Write the access token information >>> >>>> // Both maps have the same keys. >>> >>>> Iterator<String> typeIterator = aclsMap.keySet().iterator(); >>> >>>> while (typeIterator.hasNext()) >>> >>>> { >>> >>>> String aclType = typeIterator.next(); >>> >>>> >>> >>>> writeACLs(out,aclType,aclsMap.get(aclType),denyAclsMap.get(aclType)); >>> >>>> } >>> >>>> >>> >>>> // Write the arguments >>> >>>> for (String name : arguments.keySet()) >>> >>>> { >>> >>>> List<String> values = arguments.get(name); >>> >>>> writeField(out,name,values); >>> >>>> } >>> >>>> >>> >>>> // Write the metadata, each in a field by itself >>> >>>> buildSolrParamsFromMetadata(out); >>> >>>> >>> >>>> // These are unnecessary now in the case of non-solrcloud >>> >>>> setups, >>> >>>> because we overrode the SolrJ posting method to use multipart. >>> >>>> //writeField(out,LITERAL+"stream_size",String.valueOf(length)); >>> >>>> //writeField(out,LITERAL+"stream_name",document.getFileName()); >>> >>>> >>> >>>> // General hint for Tika >>> >>>> if (!StringUtils.isBlank(document.getFileName())) >>> >>>> writeField(out,"resource.name",document.getFileName()); >>> >>>> >>> >>>> // Write the commitWithin parameter >>> >>>> if (commitWithin != null) >>> >>>> writeField(out,COMMITWITHIN_METADATA,commitWithin); >>> >>>> >>> >>>> contentStreamUpdateRequest.setParams(out); >>> >>>> >>> >>>> contentStreamUpdateRequest.addContentStream(new >>> >>>> RepositoryDocumentStream(is,length,contentType,contentName)); >>> >>>> } >>> >>>> <<<<<< >>> >>>> >>> >>>> The ContentStreamUpdateRequest object is defined within SolrJ. >>> >>>> Normally >>> >>>> this would be the end of ManifoldCF involvement, but we have also >>> >>>> needed to >>> >>>> override some SolrJ classes because of bugs. So it is possible that >>> >>>> we >>> >>>> could fix this behavior if the problem is within the code we have >>> >>>> changed. >>> >>>> However, having said that, I am not sure that the differences you >>> >>>> report are >>> >>>> significant in any way. The w3c spec for multipart HTTP requests is >>> >>>> what >>> >>>> you'd want to look at for that. >>> >>>> >>> >>>> Please see ModifiedHttpMultipart.java for more details. >>> >>>> >>> >>>> Thanks, >>> >>>> Karl >>> >>>> >>> >>>> >>> >>>> On Fri, Nov 25, 2016 at 5:24 AM, Furkan KAMACI >>> >>>> <[email protected]> >>> >>>> wrote: >>> >>>>> >>> >>>>> Hi Karl, >>> >>>>> >>> >>>>> I used default values for Solr. At my Solr output connector "Use the >>> >>>>> Extract Update Handler" is clicked. Update handler is defined as: >>> >>>>> "/update/extract". There is no Tika content extractor defined at Job >>> >>>>> pipeline. >>> >>>>> >>> >>>>> I have WireShark captures and logs from both ManifoldCF and Solr. I >>> >>>>> can >>> >>>>> share them if you want. >>> >>>>> >>> >>>>> Kind Regards, >>> >>>>> Furkan KAMACI >>> >>>>> >>> >>>>> On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright <[email protected]> >>> >>>>> wrote: >>> >>>>>> >>> >>>>>> Is this being indexed via the extracting update handler? What does >>> >>>>>> your pipeline look like? Is the tika extractor in the pipeline? >>> >>>>>> >>> >>>>>> >>> >>>>>> Karl >>> >>>>>> >>> >>>>>> >>> >>>>>> On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI >>> >>>>>> <[email protected]> wrote: >>> >>>>>>> >>> >>>>>>> I've indexed a file via ManifoldCF to Solr which has a content >>> >>>>>>> starts >>> >>>>>>> with: >>> >>>>>>> >>> >>>>>>> 1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire" >>> >>>>>>> directed by Elia Kazan, 1951 >>> >>>>>>> >>> >>>>>>> 2. Portrait of Marlon Brando for "A Streetcar Named Desire" >>> >>>>>>> directed >>> >>>>>>> by Elia Kazan, 1951 >>> >>>>>>> >>> >>>>>>> 3. Portrait of Marlon Brando for "A Streetcar Named Desire" >>> >>>>>>> directed >>> >>>>>>> by Elia Kazan, 1951 >>> >>>>>>> >>> >>>>>>> However when I check Solr I see that at content: >>> >>>>>>> >>> >>>>>>> " \n \nstream_source_info MARLON BRANDO.rtf >>> >>>>>>> \nstream_content_type >>> >>>>>>> application/rtf \nstream_size 13580 \nstream_name MARLON >>> >>>>>>> BRANDO.rtf >>> >>>>>>> \nContent-Type application/rtf \nresourceName MARLON BRANDO.rtf >>> >>>>>>> \n \n >>> >>>>>>> \n 1. Vivien Leigh and Marlon Brando in \"A Streetcar Named >>> >>>>>>> Desire\" >>> >>>>>>> directed by Elia Kazan \n" >>> >>>>>>> >>> >>>>>>> There are 2 problems at here. >>> >>>>>>> >>> >>>>>>> 1) There are newline characters which are unnecessary. >>> >>>>>>> >>> >>>>>>> 2) There are metadata prepended to content field which should not >>> >>>>>>> be. >>> >>>>>>> >>> >>>>>>> So, one can think that problem maybe at Solr or ManifoldCF >>> >>>>>>> (related >>> >>>>>>> to Tika). When I index same document to Solr via cURL there are >>> >>>>>>> not new line >>> >>>>>>> characters or metadata prepended. >>> >>>>>>> >>> >>>>>>> What do you think about for a solution? >>> >>>>>>> >>> >>>>>>> Kind Regards, >>> >>>>>>> Furkan KAMACI >>> >>>>>>> >>> >>>>>> >>> >>>>> >>> >>>> >>> >>> >>> >> >>> > >> >>
