Thanks Shinichiro! Changing solrconfig.xml as you suggested resolved it! On Sat, Nov 26, 2016 at 5:23 PM, Shinichiro Abe <[email protected]> wrote:
> Hi, > > If you use the following parameters, you could remove metadata infos > from content field value. > > $ curl 'http://localhost:8983/solr/collection1/update/extract? > literal.id=doc1&commit=true&fmap.content=content_t&fmap. > uprefix=ignored_&captureAttr=true&fmap.div=ignored_&fmap.a=ignored_' > -F "myfile=@/path/to/file.pdf" > > Those parameters is in Solr 4 config, but it has been removed since Solr 5. > > Regards, > Shinichiro Abe > > 2016-11-26 23:55 GMT+09:00 Shinichiro Abe <[email protected]>: > > Hi, > > > >> Everything is OK when you directly send data to Solr without MFC. > > How did you send files? > > > > I just sent a pdf to Solr by curl, metadata is included to the content > > field value. > > > > command: > > $ curl 'http://localhost:8983/solr/collection1/update/extract? > literal.id=doc1&commit=true&fmap.content=content_t' > > -F "myfile=@/path/to/file.pdf" > > > > content field value with metadata: > > "content_t":[" \n \n date 2016-06-27T13:15:05Z \n pdf:PDFVersion 1.4 ... > > > > The content field value indexed by Solr Cell contains the metadata > > strings unless using field mapping in solrconfig.xml. > > > > Shinichiro Abe > > > > > > > > 2016-11-26 21:26 GMT+09:00 Furkan KAMACI <[email protected]>: > >> Hi Shinichiro, > >> > >> Yes, I can see the content with that way. However, beside the new line > >> characters, there is metadata information prepended > >> to content. Everything is OK when you directly send data to Solr without > >> MFC. > >> > >> For example one of my content starts with it: > >> > >> \n \n stream_size 298979 \n pdf:PDFVersion 1.4 \n X-Parsed-By > >> org.apache.tika.parser.DefaultParser \n X-Parsed-By > >> org.apache.tika.parser.pdf.PDFParser \n xmp:CreatorTool Google \n > >> stream_content_type application/pdf \n access_permission:modify_ > annotations > >> true \n access_permission:can_print_degraded true > >> > >> I am suspicious about that the way that MFC sends data to Solr. Could > you > >> also check it? > >> > >> Kind Regards, > >> Furkan KAMACI > >> > >> On Sat, Nov 26, 2016 at 2:52 AM, Shinichiro Abe < > [email protected]> > >> wrote: > >>> > >>> Hi Furkan, > >>> > >>> Please see the previous mail[1] which may be the same issue. > >>> And as far as I know the new line chars will appear in any Tika > >>> version and you can see by json format in Solr. When you want to > >>> remove that, please use charfilter or updateprocessor in Solr. I think > >>> even when fields have new line chars, searching works, so I don't > >>> think it is mcf's solrj issue. > >>> > >>> > >>> [1]http://mail-archives.apache.org/mod_mbox/ > manifoldcf-user/201610.mbox/%3CCA%2BeTv_UO5DKgza%2Bo0bVQF_ > i%2B8wtHdz61gP51XHu2gF3rKLn%2BMg%40mail.gmail.com%3E > >>> > >>> Shinichiro Abe > >>> > >>> 2016-11-26 4:11 GMT+09:00 Karl Wright <[email protected]>: > >>> > I am on vacation today and have other responsibilities. However, I > >>> > believe > >>> > Shinichiro Abe might be able to test this out. He redid the Solr > >>> > integration for SolrJ 6.3. > >>> > > >>> > Thanks, > >>> > Karl > >>> > > >>> > > >>> > On Fri, Nov 25, 2016 at 1:54 PM, Furkan KAMACI < > [email protected]> > >>> > wrote: > >>> >> > >>> >> Hi Karl, > >>> >> > >>> >> Could you try to test MFC with Solr? I cannot see content field > either > >>> >> with Windows Shares or File System with Solr 4.x, 5.x, 6.x. Only > Solr > >>> >> 4.x > >>> >> have content and it is as I defined. Code part of sending content > as a > >>> >> stream may have some problems. > >>> >> > >>> >> Kind Regards, > >>> >> Furkan KAMACI > >>> >> > >>> >> > >>> >> On Fri, Nov 25, 2016 at 4:13 PM, Furkan KAMACI < > [email protected]> > >>> >> wrote: > >>> >>> > >>> >>> Hi Karl, > >>> >>> > >>> >>> By the way, I've tried different versions of Solr and couldn't get > >>> >>> content or got as I've explained. When I checkout the MFC trunk > which > >>> >>> uses > >>> >>> Solr 6.3.0 and when I use Solr 6.3.0 as output connector I can see > >>> >>> documents > >>> >>> are indexed but I cannot even see "content" field. > >>> >>> > >>> >>> Kind Regards, > >>> >>> Furkan KAMACI > >>> >>> > >>> >>> On Fri, Nov 25, 2016 at 2:01 PM, Karl Wright <[email protected]> > >>> >>> wrote: > >>> >>>> > >>> >>>> Hi Furkan, > >>> >>>> > >>> >>>> The following code is used to set up a SolrJ object that is then > >>> >>>> later > >>> >>>> converted to a post request: > >>> >>>> > >>> >>>> >>>>>> > >>> >>>> private void buildExtractUpdateHandlerRequest( long length, > >>> >>>> InputStream is, String contentType, > >>> >>>> String contentName, > >>> >>>> ContentStreamUpdateRequest contentStreamUpdateRequest ) > >>> >>>> throws IOException > >>> >>>> { > >>> >>>> ModifiableSolrParams out = new ModifiableSolrParams(); > >>> >>>> > >>> >>>> // Write the id field > >>> >>>> writeField(out,LITERAL+idAttributeName,documentURI); > >>> >>>> // Write the rest of the attributes > >>> >>>> if (originalSizeAttributeName != null) > >>> >>>> { > >>> >>>> Long size = document.getOriginalSize(); > >>> >>>> if (size != null) > >>> >>>> // Write value > >>> >>>> > >>> >>>> writeField(out,LITERAL+originalSizeAttributeName, > size.toString()); > >>> >>>> } > >>> >>>> if (modifiedDateAttributeName != null) > >>> >>>> { > >>> >>>> Date date = document.getModifiedDate(); > >>> >>>> if (date != null) > >>> >>>> // Write value > >>> >>>> > >>> >>>> > >>> >>>> writeField(out,LITERAL+modifiedDateAttributeName, > DateParser.formatISO8601Date(date)); > >>> >>>> } > >>> >>>> if (createdDateAttributeName != null) > >>> >>>> { > >>> >>>> Date date = document.getCreatedDate(); > >>> >>>> if (date != null) > >>> >>>> // Write value > >>> >>>> > >>> >>>> > >>> >>>> writeField(out,LITERAL+createdDateAttributeName, > DateParser.formatISO8601Date(date)); > >>> >>>> } > >>> >>>> if (indexedDateAttributeName != null) > >>> >>>> { > >>> >>>> Date date = document.getIndexingDate(); > >>> >>>> if (date != null) > >>> >>>> // Write value > >>> >>>> > >>> >>>> > >>> >>>> writeField(out,LITERAL+indexedDateAttributeName, > DateParser.formatISO8601Date(date)); > >>> >>>> } > >>> >>>> if (fileNameAttributeName != null) > >>> >>>> { > >>> >>>> String fileName = document.getFileName(); > >>> >>>> if (!StringUtils.isBlank(fileName)) > >>> >>>> writeField(out,LITERAL+fileNameAttributeName,fileName); > >>> >>>> } > >>> >>>> if (mimeTypeAttributeName != null) > >>> >>>> { > >>> >>>> String mimeType = document.getMimeType(); > >>> >>>> if (!StringUtils.isBlank(mimeType)) > >>> >>>> writeField(out,LITERAL+mimeTypeAttributeName,mimeType); > >>> >>>> } > >>> >>>> > >>> >>>> // Write the access token information > >>> >>>> // Both maps have the same keys. > >>> >>>> Iterator<String> typeIterator = aclsMap.keySet().iterator(); > >>> >>>> while (typeIterator.hasNext()) > >>> >>>> { > >>> >>>> String aclType = typeIterator.next(); > >>> >>>> > >>> >>>> writeACLs(out,aclType,aclsMap.get(aclType),denyAclsMap.get( > aclType)); > >>> >>>> } > >>> >>>> > >>> >>>> // Write the arguments > >>> >>>> for (String name : arguments.keySet()) > >>> >>>> { > >>> >>>> List<String> values = arguments.get(name); > >>> >>>> writeField(out,name,values); > >>> >>>> } > >>> >>>> > >>> >>>> // Write the metadata, each in a field by itself > >>> >>>> buildSolrParamsFromMetadata(out); > >>> >>>> > >>> >>>> // These are unnecessary now in the case of non-solrcloud > >>> >>>> setups, > >>> >>>> because we overrode the SolrJ posting method to use multipart. > >>> >>>> //writeField(out,LITERAL+"stream_size",String.valueOf( > length)); > >>> >>>> //writeField(out,LITERAL+"stream_name",document. > getFileName()); > >>> >>>> > >>> >>>> // General hint for Tika > >>> >>>> if (!StringUtils.isBlank(document.getFileName())) > >>> >>>> writeField(out,"resource.name",document.getFileName()); > >>> >>>> > >>> >>>> // Write the commitWithin parameter > >>> >>>> if (commitWithin != null) > >>> >>>> writeField(out,COMMITWITHIN_METADATA,commitWithin); > >>> >>>> > >>> >>>> contentStreamUpdateRequest.setParams(out); > >>> >>>> > >>> >>>> contentStreamUpdateRequest.addContentStream(new > >>> >>>> RepositoryDocumentStream(is,length,contentType,contentName)); > >>> >>>> } > >>> >>>> <<<<<< > >>> >>>> > >>> >>>> The ContentStreamUpdateRequest object is defined within SolrJ. > >>> >>>> Normally > >>> >>>> this would be the end of ManifoldCF involvement, but we have also > >>> >>>> needed to > >>> >>>> override some SolrJ classes because of bugs. So it is possible > that > >>> >>>> we > >>> >>>> could fix this behavior if the problem is within the code we have > >>> >>>> changed. > >>> >>>> However, having said that, I am not sure that the differences you > >>> >>>> report are > >>> >>>> significant in any way. The w3c spec for multipart HTTP requests > is > >>> >>>> what > >>> >>>> you'd want to look at for that. > >>> >>>> > >>> >>>> Please see ModifiedHttpMultipart.java for more details. > >>> >>>> > >>> >>>> Thanks, > >>> >>>> Karl > >>> >>>> > >>> >>>> > >>> >>>> On Fri, Nov 25, 2016 at 5:24 AM, Furkan KAMACI > >>> >>>> <[email protected]> > >>> >>>> wrote: > >>> >>>>> > >>> >>>>> Hi Karl, > >>> >>>>> > >>> >>>>> I used default values for Solr. At my Solr output connector "Use > the > >>> >>>>> Extract Update Handler" is clicked. Update handler is defined as: > >>> >>>>> "/update/extract". There is no Tika content extractor defined at > Job > >>> >>>>> pipeline. > >>> >>>>> > >>> >>>>> I have WireShark captures and logs from both ManifoldCF and > Solr. I > >>> >>>>> can > >>> >>>>> share them if you want. > >>> >>>>> > >>> >>>>> Kind Regards, > >>> >>>>> Furkan KAMACI > >>> >>>>> > >>> >>>>> On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright < > [email protected]> > >>> >>>>> wrote: > >>> >>>>>> > >>> >>>>>> Is this being indexed via the extracting update handler? What > does > >>> >>>>>> your pipeline look like? Is the tika extractor in the pipeline? > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> Karl > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI > >>> >>>>>> <[email protected]> wrote: > >>> >>>>>>> > >>> >>>>>>> I've indexed a file via ManifoldCF to Solr which has a content > >>> >>>>>>> starts > >>> >>>>>>> with: > >>> >>>>>>> > >>> >>>>>>> 1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire" > >>> >>>>>>> directed by Elia Kazan, 1951 > >>> >>>>>>> > >>> >>>>>>> 2. Portrait of Marlon Brando for "A Streetcar Named Desire" > >>> >>>>>>> directed > >>> >>>>>>> by Elia Kazan, 1951 > >>> >>>>>>> > >>> >>>>>>> 3. Portrait of Marlon Brando for "A Streetcar Named Desire" > >>> >>>>>>> directed > >>> >>>>>>> by Elia Kazan, 1951 > >>> >>>>>>> > >>> >>>>>>> However when I check Solr I see that at content: > >>> >>>>>>> > >>> >>>>>>> " \n \nstream_source_info MARLON BRANDO.rtf > >>> >>>>>>> \nstream_content_type > >>> >>>>>>> application/rtf \nstream_size 13580 \nstream_name MARLON > >>> >>>>>>> BRANDO.rtf > >>> >>>>>>> \nContent-Type application/rtf \nresourceName MARLON > BRANDO.rtf > >>> >>>>>>> \n \n > >>> >>>>>>> \n 1. Vivien Leigh and Marlon Brando in \"A Streetcar Named > >>> >>>>>>> Desire\" > >>> >>>>>>> directed by Elia Kazan \n" > >>> >>>>>>> > >>> >>>>>>> There are 2 problems at here. > >>> >>>>>>> > >>> >>>>>>> 1) There are newline characters which are unnecessary. > >>> >>>>>>> > >>> >>>>>>> 2) There are metadata prepended to content field which should > not > >>> >>>>>>> be. > >>> >>>>>>> > >>> >>>>>>> So, one can think that problem maybe at Solr or ManifoldCF > >>> >>>>>>> (related > >>> >>>>>>> to Tika). When I index same document to Solr via cURL there are > >>> >>>>>>> not new line > >>> >>>>>>> characters or metadata prepended. > >>> >>>>>>> > >>> >>>>>>> What do you think about for a solution? > >>> >>>>>>> > >>> >>>>>>> Kind Regards, > >>> >>>>>>> Furkan KAMACI > >>> >>>>>>> > >>> >>>>>> > >>> >>>>> > >>> >>>> > >>> >>> > >>> >> > >>> > > >> > >> >
