Re: Unnecessary Newline Characters and Metadata at Content

Shinichiro Abe Sat, 26 Nov 2016 06:57:01 -0800

Hi,

> Everything is OK when you directly send data to Solr without MFC.
How did you send files?


I just sent a pdf to Solr by curl, metadata is included to the content
field value.

command:
$ curl 
'http://localhost:8983/solr/collection1/update/extract?literal.id=doc1&commit=true&fmap.content=content_t'
-F "myfile=@/path/to/file.pdf"

content field value with metadata:
"content_t":[" \n \n date 2016-06-27T13:15:05Z  \n pdf:PDFVersion 1.4 ...

The content field value indexed by Solr Cell contains the metadata
strings unless using field mapping in solrconfig.xml.

Shinichiro Abe



2016-11-26 21:26 GMT+09:00 Furkan KAMACI <[email protected]>:
> Hi Shinichiro,
>
> Yes, I can see the content with that way. However, beside the new line
> characters, there is metadata information prepended
> to content. Everything is OK when you directly send data to Solr without
> MFC.
>
> For example one of my content starts with it:
>
> \n \n stream_size 298979  \n pdf:PDFVersion 1.4  \n X-Parsed-By
> org.apache.tika.parser.DefaultParser  \n X-Parsed-By
> org.apache.tika.parser.pdf.PDFParser  \n xmp:CreatorTool Google  \n
> stream_content_type application/pdf  \n access_permission:modify_annotations
> true  \n access_permission:can_print_degraded true
>
> I am suspicious about that the way that MFC sends data to Solr. Could you
> also check it?
>
> Kind Regards,
> Furkan KAMACI
>
> On Sat, Nov 26, 2016 at 2:52 AM, Shinichiro Abe <[email protected]>
> wrote:
>>
>> Hi Furkan,
>>
>> Please see the previous mail[1] which may be the same issue.
>> And as far as I know the new line chars will appear in any Tika
>> version and you can see by json format in Solr. When you want to
>> remove that, please use charfilter or updateprocessor in Solr. I think
>> even when fields have new line chars, searching works, so I don't
>> think it is mcf's solrj issue.
>>
>>
>> [1]http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201610.mbox/%3CCA%2BeTv_UO5DKgza%2Bo0bVQF_i%2B8wtHdz61gP51XHu2gF3rKLn%2BMg%40mail.gmail.com%3E
>>
>> Shinichiro Abe
>>
>> 2016-11-26 4:11 GMT+09:00 Karl Wright <[email protected]>:
>> > I am on vacation today and have other responsibilities.  However, I
>> > believe
>> > Shinichiro Abe might be able to test this out.  He redid the Solr
>> > integration for SolrJ 6.3.
>> >
>> > Thanks,
>> > Karl
>> >
>> >
>> > On Fri, Nov 25, 2016 at 1:54 PM, Furkan KAMACI <[email protected]>
>> > wrote:
>> >>
>> >> Hi Karl,
>> >>
>> >> Could you try to test MFC with Solr? I cannot see content field either
>> >> with Windows Shares or File System with Solr 4.x, 5.x, 6.x. Only Solr
>> >> 4.x
>> >> have content and it is as I defined. Code part of sending content as a
>> >> stream may have some problems.
>> >>
>> >> Kind Regards,
>> >> Furkan KAMACI
>> >>
>> >>
>> >> On Fri, Nov 25, 2016 at 4:13 PM, Furkan KAMACI <[email protected]>
>> >> wrote:
>> >>>
>> >>> Hi Karl,
>> >>>
>> >>> By the way, I've tried different versions of Solr and couldn't get
>> >>> content or got as I've explained. When I checkout the MFC trunk which
>> >>> uses
>> >>> Solr 6.3.0 and when I use Solr 6.3.0 as output connector I can see
>> >>> documents
>> >>> are indexed but I cannot even see "content" field.
>> >>>
>> >>> Kind Regards,
>> >>> Furkan KAMACI
>> >>>
>> >>> On Fri, Nov 25, 2016 at 2:01 PM, Karl Wright <[email protected]>
>> >>> wrote:
>> >>>>
>> >>>> Hi Furkan,
>> >>>>
>> >>>> The following code is used to set up a SolrJ object that is then
>> >>>> later
>> >>>> converted to a post request:
>> >>>>
>> >>>> >>>>>>
>> >>>>     private void buildExtractUpdateHandlerRequest( long length,
>> >>>> InputStream is, String contentType,
>> >>>>       String contentName,
>> >>>>       ContentStreamUpdateRequest contentStreamUpdateRequest )
>> >>>>       throws IOException
>> >>>>     {
>> >>>>       ModifiableSolrParams out = new ModifiableSolrParams();
>> >>>>
>> >>>>       // Write the id field
>> >>>>       writeField(out,LITERAL+idAttributeName,documentURI);
>> >>>>       // Write the rest of the attributes
>> >>>>       if (originalSizeAttributeName != null)
>> >>>>       {
>> >>>>         Long size = document.getOriginalSize();
>> >>>>         if (size != null)
>> >>>>           // Write value
>> >>>>
>> >>>> writeField(out,LITERAL+originalSizeAttributeName,size.toString());
>> >>>>       }
>> >>>>       if (modifiedDateAttributeName != null)
>> >>>>       {
>> >>>>         Date date = document.getModifiedDate();
>> >>>>         if (date != null)
>> >>>>           // Write value
>> >>>>
>> >>>>
>> >>>> writeField(out,LITERAL+modifiedDateAttributeName,DateParser.formatISO8601Date(date));
>> >>>>       }
>> >>>>       if (createdDateAttributeName != null)
>> >>>>       {
>> >>>>         Date date = document.getCreatedDate();
>> >>>>         if (date != null)
>> >>>>           // Write value
>> >>>>
>> >>>>
>> >>>> writeField(out,LITERAL+createdDateAttributeName,DateParser.formatISO8601Date(date));
>> >>>>       }
>> >>>>       if (indexedDateAttributeName != null)
>> >>>>       {
>> >>>>         Date date = document.getIndexingDate();
>> >>>>         if (date != null)
>> >>>>           // Write value
>> >>>>
>> >>>>
>> >>>> writeField(out,LITERAL+indexedDateAttributeName,DateParser.formatISO8601Date(date));
>> >>>>       }
>> >>>>       if (fileNameAttributeName != null)
>> >>>>       {
>> >>>>         String fileName = document.getFileName();
>> >>>>         if (!StringUtils.isBlank(fileName))
>> >>>>           writeField(out,LITERAL+fileNameAttributeName,fileName);
>> >>>>       }
>> >>>>       if (mimeTypeAttributeName != null)
>> >>>>       {
>> >>>>         String mimeType = document.getMimeType();
>> >>>>         if (!StringUtils.isBlank(mimeType))
>> >>>>           writeField(out,LITERAL+mimeTypeAttributeName,mimeType);
>> >>>>       }
>> >>>>
>> >>>>       // Write the access token information
>> >>>>       // Both maps have the same keys.
>> >>>>       Iterator<String> typeIterator = aclsMap.keySet().iterator();
>> >>>>       while (typeIterator.hasNext())
>> >>>>       {
>> >>>>         String aclType = typeIterator.next();
>> >>>>
>> >>>> writeACLs(out,aclType,aclsMap.get(aclType),denyAclsMap.get(aclType));
>> >>>>       }
>> >>>>
>> >>>>       // Write the arguments
>> >>>>       for (String name : arguments.keySet())
>> >>>>       {
>> >>>>         List<String> values = arguments.get(name);
>> >>>>         writeField(out,name,values);
>> >>>>       }
>> >>>>
>> >>>>       // Write the metadata, each in a field by itself
>> >>>>       buildSolrParamsFromMetadata(out);
>> >>>>
>> >>>>       // These are unnecessary now in the case of non-solrcloud
>> >>>> setups,
>> >>>> because we overrode the SolrJ posting method to use multipart.
>> >>>>       //writeField(out,LITERAL+"stream_size",String.valueOf(length));
>> >>>>       //writeField(out,LITERAL+"stream_name",document.getFileName());
>> >>>>
>> >>>>       // General hint for Tika
>> >>>>       if (!StringUtils.isBlank(document.getFileName()))
>> >>>>         writeField(out,"resource.name",document.getFileName());
>> >>>>
>> >>>>       // Write the commitWithin parameter
>> >>>>       if (commitWithin != null)
>> >>>>         writeField(out,COMMITWITHIN_METADATA,commitWithin);
>> >>>>
>> >>>>       contentStreamUpdateRequest.setParams(out);
>> >>>>
>> >>>>       contentStreamUpdateRequest.addContentStream(new
>> >>>> RepositoryDocumentStream(is,length,contentType,contentName));
>> >>>>     }
>> >>>> <<<<<<
>> >>>>
>> >>>> The ContentStreamUpdateRequest object is defined within SolrJ.
>> >>>> Normally
>> >>>> this would be the end of ManifoldCF involvement, but we have also
>> >>>> needed to
>> >>>> override some SolrJ classes because of bugs.  So it is possible that
>> >>>> we
>> >>>> could fix this behavior if the problem is within the code we have
>> >>>> changed.
>> >>>> However, having said that, I am not sure that the differences you
>> >>>> report are
>> >>>> significant in any way. The w3c spec for multipart HTTP requests is
>> >>>> what
>> >>>> you'd want to look at for that.
>> >>>>
>> >>>> Please see ModifiedHttpMultipart.java for more details.
>> >>>>
>> >>>> Thanks,
>> >>>> Karl
>> >>>>
>> >>>>
>> >>>> On Fri, Nov 25, 2016 at 5:24 AM, Furkan KAMACI
>> >>>> <[email protected]>
>> >>>> wrote:
>> >>>>>
>> >>>>> Hi Karl,
>> >>>>>
>> >>>>> I used default values for Solr. At my Solr output connector "Use the
>> >>>>> Extract Update Handler" is clicked. Update handler is defined as:
>> >>>>> "/update/extract". There is no Tika content extractor defined at Job
>> >>>>> pipeline.
>> >>>>>
>> >>>>> I have WireShark captures and logs from both ManifoldCF and Solr. I
>> >>>>> can
>> >>>>> share them if you want.
>> >>>>>
>> >>>>> Kind Regards,
>> >>>>> Furkan KAMACI
>> >>>>>
>> >>>>> On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright <[email protected]>
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> Is this being indexed via the extracting update handler?  What does
>> >>>>>> your pipeline look like?  Is the tika extractor in the pipeline?
>> >>>>>>
>> >>>>>>
>> >>>>>> Karl
>> >>>>>>
>> >>>>>>
>> >>>>>> On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI
>> >>>>>> <[email protected]> wrote:
>> >>>>>>>
>> >>>>>>> I've indexed a file via ManifoldCF to Solr which has a content
>> >>>>>>> starts
>> >>>>>>> with:
>> >>>>>>>
>> >>>>>>> 1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire"
>> >>>>>>> directed by Elia Kazan, 1951
>> >>>>>>>
>> >>>>>>> 2. Portrait of Marlon Brando for "A Streetcar Named Desire"
>> >>>>>>> directed
>> >>>>>>> by Elia Kazan, 1951
>> >>>>>>>
>> >>>>>>> 3. Portrait of Marlon Brando for "A Streetcar Named Desire"
>> >>>>>>> directed
>> >>>>>>> by Elia Kazan, 1951
>> >>>>>>>
>> >>>>>>> However when I check Solr I see that at content:
>> >>>>>>>
>> >>>>>>>  " \n \nstream_source_info MARLON BRANDO.rtf
>> >>>>>>> \nstream_content_type
>> >>>>>>> application/rtf   \nstream_size 13580   \nstream_name MARLON
>> >>>>>>> BRANDO.rtf
>> >>>>>>> \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf
>> >>>>>>> \n  \n
>> >>>>>>> \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named
>> >>>>>>> Desire\"
>> >>>>>>> directed by Elia Kazan \n"
>> >>>>>>>
>> >>>>>>> There are 2 problems at here.
>> >>>>>>>
>> >>>>>>> 1) There are newline characters which are unnecessary.
>> >>>>>>>
>> >>>>>>> 2) There are metadata prepended to content field which should not
>> >>>>>>> be.
>> >>>>>>>
>> >>>>>>> So, one can think that problem maybe at Solr or ManifoldCF
>> >>>>>>> (related
>> >>>>>>> to Tika). When I index same document to Solr via cURL there are
>> >>>>>>> not new line
>> >>>>>>> characters or metadata prepended.
>> >>>>>>>
>> >>>>>>> What do you think about for a solution?
>> >>>>>>>
>> >>>>>>> Kind Regards,
>> >>>>>>> Furkan KAMACI
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>
>

Re: Unnecessary Newline Characters and Metadata at Content

Reply via email to