Re: Unnecessary Newline Characters and Metadata at Content

Furkan KAMACI Sat, 26 Nov 2016 08:22:32 -0800

Thanks Shinichiro! Changing solrconfig.xml as you suggested resolved it!

On Sat, Nov 26, 2016 at 5:23 PM, Shinichiro Abe <[email protected]>
wrote:


> Hi,
>
> If you use the following parameters, you could remove metadata infos
> from content field value.
>
> $ curl 'http://localhost:8983/solr/collection1/update/extract?
> literal.id=doc1&commit=true&fmap.content=content_t&fmap.
> uprefix=ignored_&captureAttr=true&fmap.div=ignored_&fmap.a=ignored_'
> -F "myfile=@/path/to/file.pdf"
>
> Those parameters is in Solr 4 config, but it has been removed since Solr 5.
>
> Regards,
> Shinichiro Abe
>
> 2016-11-26 23:55 GMT+09:00 Shinichiro Abe <[email protected]>:
> > Hi,
> >
> >> Everything is OK when you directly send data to Solr without MFC.
> > How did you send files?
> >
> > I just sent a pdf to Solr by curl, metadata is included to the content
> > field value.
> >
> > command:
> > $ curl 'http://localhost:8983/solr/collection1/update/extract?
> literal.id=doc1&commit=true&fmap.content=content_t'
> > -F "myfile=@/path/to/file.pdf"
> >
> > content field value with metadata:
> > "content_t":[" \n \n date 2016-06-27T13:15:05Z  \n pdf:PDFVersion 1.4 ...
> >
> > The content field value indexed by Solr Cell contains the metadata
> > strings unless using field mapping in solrconfig.xml.
> >
> > Shinichiro Abe
> >
> >
> >
> > 2016-11-26 21:26 GMT+09:00 Furkan KAMACI <[email protected]>:
> >> Hi Shinichiro,
> >>
> >> Yes, I can see the content with that way. However, beside the new line
> >> characters, there is metadata information prepended
> >> to content. Everything is OK when you directly send data to Solr without
> >> MFC.
> >>
> >> For example one of my content starts with it:
> >>
> >> \n \n stream_size 298979  \n pdf:PDFVersion 1.4  \n X-Parsed-By
> >> org.apache.tika.parser.DefaultParser  \n X-Parsed-By
> >> org.apache.tika.parser.pdf.PDFParser  \n xmp:CreatorTool Google  \n
> >> stream_content_type application/pdf  \n access_permission:modify_
> annotations
> >> true  \n access_permission:can_print_degraded true
> >>
> >> I am suspicious about that the way that MFC sends data to Solr. Could
> you
> >> also check it?
> >>
> >> Kind Regards,
> >> Furkan KAMACI
> >>
> >> On Sat, Nov 26, 2016 at 2:52 AM, Shinichiro Abe <
> [email protected]>
> >> wrote:
> >>>
> >>> Hi Furkan,
> >>>
> >>> Please see the previous mail[1] which may be the same issue.
> >>> And as far as I know the new line chars will appear in any Tika
> >>> version and you can see by json format in Solr. When you want to
> >>> remove that, please use charfilter or updateprocessor in Solr. I think
> >>> even when fields have new line chars, searching works, so I don't
> >>> think it is mcf's solrj issue.
> >>>
> >>>
> >>> [1]http://mail-archives.apache.org/mod_mbox/
> manifoldcf-user/201610.mbox/%3CCA%2BeTv_UO5DKgza%2Bo0bVQF_
> i%2B8wtHdz61gP51XHu2gF3rKLn%2BMg%40mail.gmail.com%3E
> >>>
> >>> Shinichiro Abe
> >>>
> >>> 2016-11-26 4:11 GMT+09:00 Karl Wright <[email protected]>:
> >>> > I am on vacation today and have other responsibilities.  However, I
> >>> > believe
> >>> > Shinichiro Abe might be able to test this out.  He redid the Solr
> >>> > integration for SolrJ 6.3.
> >>> >
> >>> > Thanks,
> >>> > Karl
> >>> >
> >>> >
> >>> > On Fri, Nov 25, 2016 at 1:54 PM, Furkan KAMACI <
> [email protected]>
> >>> > wrote:
> >>> >>
> >>> >> Hi Karl,
> >>> >>
> >>> >> Could you try to test MFC with Solr? I cannot see content field
> either
> >>> >> with Windows Shares or File System with Solr 4.x, 5.x, 6.x. Only
> Solr
> >>> >> 4.x
> >>> >> have content and it is as I defined. Code part of sending content
> as a
> >>> >> stream may have some problems.
> >>> >>
> >>> >> Kind Regards,
> >>> >> Furkan KAMACI
> >>> >>
> >>> >>
> >>> >> On Fri, Nov 25, 2016 at 4:13 PM, Furkan KAMACI <
> [email protected]>
> >>> >> wrote:
> >>> >>>
> >>> >>> Hi Karl,
> >>> >>>
> >>> >>> By the way, I've tried different versions of Solr and couldn't get
> >>> >>> content or got as I've explained. When I checkout the MFC trunk
> which
> >>> >>> uses
> >>> >>> Solr 6.3.0 and when I use Solr 6.3.0 as output connector I can see
> >>> >>> documents
> >>> >>> are indexed but I cannot even see "content" field.
> >>> >>>
> >>> >>> Kind Regards,
> >>> >>> Furkan KAMACI
> >>> >>>
> >>> >>> On Fri, Nov 25, 2016 at 2:01 PM, Karl Wright <[email protected]>
> >>> >>> wrote:
> >>> >>>>
> >>> >>>> Hi Furkan,
> >>> >>>>
> >>> >>>> The following code is used to set up a SolrJ object that is then
> >>> >>>> later
> >>> >>>> converted to a post request:
> >>> >>>>
> >>> >>>> >>>>>>
> >>> >>>>     private void buildExtractUpdateHandlerRequest( long length,
> >>> >>>> InputStream is, String contentType,
> >>> >>>>       String contentName,
> >>> >>>>       ContentStreamUpdateRequest contentStreamUpdateRequest )
> >>> >>>>       throws IOException
> >>> >>>>     {
> >>> >>>>       ModifiableSolrParams out = new ModifiableSolrParams();
> >>> >>>>
> >>> >>>>       // Write the id field
> >>> >>>>       writeField(out,LITERAL+idAttributeName,documentURI);
> >>> >>>>       // Write the rest of the attributes
> >>> >>>>       if (originalSizeAttributeName != null)
> >>> >>>>       {
> >>> >>>>         Long size = document.getOriginalSize();
> >>> >>>>         if (size != null)
> >>> >>>>           // Write value
> >>> >>>>
> >>> >>>> writeField(out,LITERAL+originalSizeAttributeName,
> size.toString());
> >>> >>>>       }
> >>> >>>>       if (modifiedDateAttributeName != null)
> >>> >>>>       {
> >>> >>>>         Date date = document.getModifiedDate();
> >>> >>>>         if (date != null)
> >>> >>>>           // Write value
> >>> >>>>
> >>> >>>>
> >>> >>>> writeField(out,LITERAL+modifiedDateAttributeName,
> DateParser.formatISO8601Date(date));
> >>> >>>>       }
> >>> >>>>       if (createdDateAttributeName != null)
> >>> >>>>       {
> >>> >>>>         Date date = document.getCreatedDate();
> >>> >>>>         if (date != null)
> >>> >>>>           // Write value
> >>> >>>>
> >>> >>>>
> >>> >>>> writeField(out,LITERAL+createdDateAttributeName,
> DateParser.formatISO8601Date(date));
> >>> >>>>       }
> >>> >>>>       if (indexedDateAttributeName != null)
> >>> >>>>       {
> >>> >>>>         Date date = document.getIndexingDate();
> >>> >>>>         if (date != null)
> >>> >>>>           // Write value
> >>> >>>>
> >>> >>>>
> >>> >>>> writeField(out,LITERAL+indexedDateAttributeName,
> DateParser.formatISO8601Date(date));
> >>> >>>>       }
> >>> >>>>       if (fileNameAttributeName != null)
> >>> >>>>       {
> >>> >>>>         String fileName = document.getFileName();
> >>> >>>>         if (!StringUtils.isBlank(fileName))
> >>> >>>>           writeField(out,LITERAL+fileNameAttributeName,fileName);
> >>> >>>>       }
> >>> >>>>       if (mimeTypeAttributeName != null)
> >>> >>>>       {
> >>> >>>>         String mimeType = document.getMimeType();
> >>> >>>>         if (!StringUtils.isBlank(mimeType))
> >>> >>>>           writeField(out,LITERAL+mimeTypeAttributeName,mimeType);
> >>> >>>>       }
> >>> >>>>
> >>> >>>>       // Write the access token information
> >>> >>>>       // Both maps have the same keys.
> >>> >>>>       Iterator<String> typeIterator = aclsMap.keySet().iterator();
> >>> >>>>       while (typeIterator.hasNext())
> >>> >>>>       {
> >>> >>>>         String aclType = typeIterator.next();
> >>> >>>>
> >>> >>>> writeACLs(out,aclType,aclsMap.get(aclType),denyAclsMap.get(
> aclType));
> >>> >>>>       }
> >>> >>>>
> >>> >>>>       // Write the arguments
> >>> >>>>       for (String name : arguments.keySet())
> >>> >>>>       {
> >>> >>>>         List<String> values = arguments.get(name);
> >>> >>>>         writeField(out,name,values);
> >>> >>>>       }
> >>> >>>>
> >>> >>>>       // Write the metadata, each in a field by itself
> >>> >>>>       buildSolrParamsFromMetadata(out);
> >>> >>>>
> >>> >>>>       // These are unnecessary now in the case of non-solrcloud
> >>> >>>> setups,
> >>> >>>> because we overrode the SolrJ posting method to use multipart.
> >>> >>>>       //writeField(out,LITERAL+"stream_size",String.valueOf(
> length));
> >>> >>>>       //writeField(out,LITERAL+"stream_name",document.
> getFileName());
> >>> >>>>
> >>> >>>>       // General hint for Tika
> >>> >>>>       if (!StringUtils.isBlank(document.getFileName()))
> >>> >>>>         writeField(out,"resource.name",document.getFileName());
> >>> >>>>
> >>> >>>>       // Write the commitWithin parameter
> >>> >>>>       if (commitWithin != null)
> >>> >>>>         writeField(out,COMMITWITHIN_METADATA,commitWithin);
> >>> >>>>
> >>> >>>>       contentStreamUpdateRequest.setParams(out);
> >>> >>>>
> >>> >>>>       contentStreamUpdateRequest.addContentStream(new
> >>> >>>> RepositoryDocumentStream(is,length,contentType,contentName));
> >>> >>>>     }
> >>> >>>> <<<<<<
> >>> >>>>
> >>> >>>> The ContentStreamUpdateRequest object is defined within SolrJ.
> >>> >>>> Normally
> >>> >>>> this would be the end of ManifoldCF involvement, but we have also
> >>> >>>> needed to
> >>> >>>> override some SolrJ classes because of bugs.  So it is possible
> that
> >>> >>>> we
> >>> >>>> could fix this behavior if the problem is within the code we have
> >>> >>>> changed.
> >>> >>>> However, having said that, I am not sure that the differences you
> >>> >>>> report are
> >>> >>>> significant in any way. The w3c spec for multipart HTTP requests
> is
> >>> >>>> what
> >>> >>>> you'd want to look at for that.
> >>> >>>>
> >>> >>>> Please see ModifiedHttpMultipart.java for more details.
> >>> >>>>
> >>> >>>> Thanks,
> >>> >>>> Karl
> >>> >>>>
> >>> >>>>
> >>> >>>> On Fri, Nov 25, 2016 at 5:24 AM, Furkan KAMACI
> >>> >>>> <[email protected]>
> >>> >>>> wrote:
> >>> >>>>>
> >>> >>>>> Hi Karl,
> >>> >>>>>
> >>> >>>>> I used default values for Solr. At my Solr output connector "Use
> the
> >>> >>>>> Extract Update Handler" is clicked. Update handler is defined as:
> >>> >>>>> "/update/extract". There is no Tika content extractor defined at
> Job
> >>> >>>>> pipeline.
> >>> >>>>>
> >>> >>>>> I have WireShark captures and logs from both ManifoldCF and
> Solr. I
> >>> >>>>> can
> >>> >>>>> share them if you want.
> >>> >>>>>
> >>> >>>>> Kind Regards,
> >>> >>>>> Furkan KAMACI
> >>> >>>>>
> >>> >>>>> On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright <
> [email protected]>
> >>> >>>>> wrote:
> >>> >>>>>>
> >>> >>>>>> Is this being indexed via the extracting update handler?  What
> does
> >>> >>>>>> your pipeline look like?  Is the tika extractor in the pipeline?
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> Karl
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI
> >>> >>>>>> <[email protected]> wrote:
> >>> >>>>>>>
> >>> >>>>>>> I've indexed a file via ManifoldCF to Solr which has a content
> >>> >>>>>>> starts
> >>> >>>>>>> with:
> >>> >>>>>>>
> >>> >>>>>>> 1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire"
> >>> >>>>>>> directed by Elia Kazan, 1951
> >>> >>>>>>>
> >>> >>>>>>> 2. Portrait of Marlon Brando for "A Streetcar Named Desire"
> >>> >>>>>>> directed
> >>> >>>>>>> by Elia Kazan, 1951
> >>> >>>>>>>
> >>> >>>>>>> 3. Portrait of Marlon Brando for "A Streetcar Named Desire"
> >>> >>>>>>> directed
> >>> >>>>>>> by Elia Kazan, 1951
> >>> >>>>>>>
> >>> >>>>>>> However when I check Solr I see that at content:
> >>> >>>>>>>
> >>> >>>>>>>  " \n \nstream_source_info MARLON BRANDO.rtf
> >>> >>>>>>> \nstream_content_type
> >>> >>>>>>> application/rtf   \nstream_size 13580   \nstream_name MARLON
> >>> >>>>>>> BRANDO.rtf
> >>> >>>>>>> \nContent-Type application/rtf   \nresourceName MARLON
> BRANDO.rtf
> >>> >>>>>>> \n  \n
> >>> >>>>>>> \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named
> >>> >>>>>>> Desire\"
> >>> >>>>>>> directed by Elia Kazan \n"
> >>> >>>>>>>
> >>> >>>>>>> There are 2 problems at here.
> >>> >>>>>>>
> >>> >>>>>>> 1) There are newline characters which are unnecessary.
> >>> >>>>>>>
> >>> >>>>>>> 2) There are metadata prepended to content field which should
> not
> >>> >>>>>>> be.
> >>> >>>>>>>
> >>> >>>>>>> So, one can think that problem maybe at Solr or ManifoldCF
> >>> >>>>>>> (related
> >>> >>>>>>> to Tika). When I index same document to Solr via cURL there are
> >>> >>>>>>> not new line
> >>> >>>>>>> characters or metadata prepended.
> >>> >>>>>>>
> >>> >>>>>>> What do you think about for a solution?
> >>> >>>>>>>
> >>> >>>>>>> Kind Regards,
> >>> >>>>>>> Furkan KAMACI
> >>> >>>>>>>
> >>> >>>>>>
> >>> >>>>>
> >>> >>>>
> >>> >>>
> >>> >>
> >>> >
> >>
> >>
>

Re: Unnecessary Newline Characters and Metadata at Content

Reply via email to