Re: Unnecessary Newline Characters and Metadata at Content

Shinichiro Abe Sat, 26 Nov 2016 07:24:06 -0800

Hi,

If you use the following parameters, you could remove metadata infos
from content field value.


$ curl 
'http://localhost:8983/solr/collection1/update/extract?literal.id=doc1&commit=true&fmap.content=content_t&fmap.uprefix=ignored_&captureAttr=true&fmap.div=ignored_&fmap.a=ignored_'
-F "myfile=@/path/to/file.pdf"

Those parameters is in Solr 4 config, but it has been removed since Solr 5.

Regards,
Shinichiro Abe

2016-11-26 23:55 GMT+09:00 Shinichiro Abe <[email protected]>:
> Hi,
>
>> Everything is OK when you directly send data to Solr without MFC.
> How did you send files?
>
> I just sent a pdf to Solr by curl, metadata is included to the content
> field value.
>
> command:
> $ curl 
> 'http://localhost:8983/solr/collection1/update/extract?literal.id=doc1&commit=true&fmap.content=content_t'
> -F "myfile=@/path/to/file.pdf"
>
> content field value with metadata:
> "content_t":[" \n \n date 2016-06-27T13:15:05Z  \n pdf:PDFVersion 1.4 ...
>
> The content field value indexed by Solr Cell contains the metadata
> strings unless using field mapping in solrconfig.xml.
>
> Shinichiro Abe
>
>
>
> 2016-11-26 21:26 GMT+09:00 Furkan KAMACI <[email protected]>:
>> Hi Shinichiro,
>>
>> Yes, I can see the content with that way. However, beside the new line
>> characters, there is metadata information prepended
>> to content. Everything is OK when you directly send data to Solr without
>> MFC.
>>
>> For example one of my content starts with it:
>>
>> \n \n stream_size 298979  \n pdf:PDFVersion 1.4  \n X-Parsed-By
>> org.apache.tika.parser.DefaultParser  \n X-Parsed-By
>> org.apache.tika.parser.pdf.PDFParser  \n xmp:CreatorTool Google  \n
>> stream_content_type application/pdf  \n access_permission:modify_annotations
>> true  \n access_permission:can_print_degraded true
>>
>> I am suspicious about that the way that MFC sends data to Solr. Could you
>> also check it?
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>> On Sat, Nov 26, 2016 at 2:52 AM, Shinichiro Abe <[email protected]>
>> wrote:
>>>
>>> Hi Furkan,
>>>
>>> Please see the previous mail[1] which may be the same issue.
>>> And as far as I know the new line chars will appear in any Tika
>>> version and you can see by json format in Solr. When you want to
>>> remove that, please use charfilter or updateprocessor in Solr. I think
>>> even when fields have new line chars, searching works, so I don't
>>> think it is mcf's solrj issue.
>>>
>>>
>>> [1]http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201610.mbox/%3CCA%2BeTv_UO5DKgza%2Bo0bVQF_i%2B8wtHdz61gP51XHu2gF3rKLn%2BMg%40mail.gmail.com%3E
>>>
>>> Shinichiro Abe
>>>
>>> 2016-11-26 4:11 GMT+09:00 Karl Wright <[email protected]>:
>>> > I am on vacation today and have other responsibilities.  However, I
>>> > believe
>>> > Shinichiro Abe might be able to test this out.  He redid the Solr
>>> > integration for SolrJ 6.3.
>>> >
>>> > Thanks,
>>> > Karl
>>> >
>>> >
>>> > On Fri, Nov 25, 2016 at 1:54 PM, Furkan KAMACI <[email protected]>
>>> > wrote:
>>> >>
>>> >> Hi Karl,
>>> >>
>>> >> Could you try to test MFC with Solr? I cannot see content field either
>>> >> with Windows Shares or File System with Solr 4.x, 5.x, 6.x. Only Solr
>>> >> 4.x
>>> >> have content and it is as I defined. Code part of sending content as a
>>> >> stream may have some problems.
>>> >>
>>> >> Kind Regards,
>>> >> Furkan KAMACI
>>> >>
>>> >>
>>> >> On Fri, Nov 25, 2016 at 4:13 PM, Furkan KAMACI <[email protected]>
>>> >> wrote:
>>> >>>
>>> >>> Hi Karl,
>>> >>>
>>> >>> By the way, I've tried different versions of Solr and couldn't get
>>> >>> content or got as I've explained. When I checkout the MFC trunk which
>>> >>> uses
>>> >>> Solr 6.3.0 and when I use Solr 6.3.0 as output connector I can see
>>> >>> documents
>>> >>> are indexed but I cannot even see "content" field.
>>> >>>
>>> >>> Kind Regards,
>>> >>> Furkan KAMACI
>>> >>>
>>> >>> On Fri, Nov 25, 2016 at 2:01 PM, Karl Wright <[email protected]>
>>> >>> wrote:
>>> >>>>
>>> >>>> Hi Furkan,
>>> >>>>
>>> >>>> The following code is used to set up a SolrJ object that is then
>>> >>>> later
>>> >>>> converted to a post request:
>>> >>>>
>>> >>>> >>>>>>
>>> >>>>     private void buildExtractUpdateHandlerRequest( long length,
>>> >>>> InputStream is, String contentType,
>>> >>>>       String contentName,
>>> >>>>       ContentStreamUpdateRequest contentStreamUpdateRequest )
>>> >>>>       throws IOException
>>> >>>>     {
>>> >>>>       ModifiableSolrParams out = new ModifiableSolrParams();
>>> >>>>
>>> >>>>       // Write the id field
>>> >>>>       writeField(out,LITERAL+idAttributeName,documentURI);
>>> >>>>       // Write the rest of the attributes
>>> >>>>       if (originalSizeAttributeName != null)
>>> >>>>       {
>>> >>>>         Long size = document.getOriginalSize();
>>> >>>>         if (size != null)
>>> >>>>           // Write value
>>> >>>>
>>> >>>> writeField(out,LITERAL+originalSizeAttributeName,size.toString());
>>> >>>>       }
>>> >>>>       if (modifiedDateAttributeName != null)
>>> >>>>       {
>>> >>>>         Date date = document.getModifiedDate();
>>> >>>>         if (date != null)
>>> >>>>           // Write value
>>> >>>>
>>> >>>>
>>> >>>> writeField(out,LITERAL+modifiedDateAttributeName,DateParser.formatISO8601Date(date));
>>> >>>>       }
>>> >>>>       if (createdDateAttributeName != null)
>>> >>>>       {
>>> >>>>         Date date = document.getCreatedDate();
>>> >>>>         if (date != null)
>>> >>>>           // Write value
>>> >>>>
>>> >>>>
>>> >>>> writeField(out,LITERAL+createdDateAttributeName,DateParser.formatISO8601Date(date));
>>> >>>>       }
>>> >>>>       if (indexedDateAttributeName != null)
>>> >>>>       {
>>> >>>>         Date date = document.getIndexingDate();
>>> >>>>         if (date != null)
>>> >>>>           // Write value
>>> >>>>
>>> >>>>
>>> >>>> writeField(out,LITERAL+indexedDateAttributeName,DateParser.formatISO8601Date(date));
>>> >>>>       }
>>> >>>>       if (fileNameAttributeName != null)
>>> >>>>       {
>>> >>>>         String fileName = document.getFileName();
>>> >>>>         if (!StringUtils.isBlank(fileName))
>>> >>>>           writeField(out,LITERAL+fileNameAttributeName,fileName);
>>> >>>>       }
>>> >>>>       if (mimeTypeAttributeName != null)
>>> >>>>       {
>>> >>>>         String mimeType = document.getMimeType();
>>> >>>>         if (!StringUtils.isBlank(mimeType))
>>> >>>>           writeField(out,LITERAL+mimeTypeAttributeName,mimeType);
>>> >>>>       }
>>> >>>>
>>> >>>>       // Write the access token information
>>> >>>>       // Both maps have the same keys.
>>> >>>>       Iterator<String> typeIterator = aclsMap.keySet().iterator();
>>> >>>>       while (typeIterator.hasNext())
>>> >>>>       {
>>> >>>>         String aclType = typeIterator.next();
>>> >>>>
>>> >>>> writeACLs(out,aclType,aclsMap.get(aclType),denyAclsMap.get(aclType));
>>> >>>>       }
>>> >>>>
>>> >>>>       // Write the arguments
>>> >>>>       for (String name : arguments.keySet())
>>> >>>>       {
>>> >>>>         List<String> values = arguments.get(name);
>>> >>>>         writeField(out,name,values);
>>> >>>>       }
>>> >>>>
>>> >>>>       // Write the metadata, each in a field by itself
>>> >>>>       buildSolrParamsFromMetadata(out);
>>> >>>>
>>> >>>>       // These are unnecessary now in the case of non-solrcloud
>>> >>>> setups,
>>> >>>> because we overrode the SolrJ posting method to use multipart.
>>> >>>>       //writeField(out,LITERAL+"stream_size",String.valueOf(length));
>>> >>>>       //writeField(out,LITERAL+"stream_name",document.getFileName());
>>> >>>>
>>> >>>>       // General hint for Tika
>>> >>>>       if (!StringUtils.isBlank(document.getFileName()))
>>> >>>>         writeField(out,"resource.name",document.getFileName());
>>> >>>>
>>> >>>>       // Write the commitWithin parameter
>>> >>>>       if (commitWithin != null)
>>> >>>>         writeField(out,COMMITWITHIN_METADATA,commitWithin);
>>> >>>>
>>> >>>>       contentStreamUpdateRequest.setParams(out);
>>> >>>>
>>> >>>>       contentStreamUpdateRequest.addContentStream(new
>>> >>>> RepositoryDocumentStream(is,length,contentType,contentName));
>>> >>>>     }
>>> >>>> <<<<<<
>>> >>>>
>>> >>>> The ContentStreamUpdateRequest object is defined within SolrJ.
>>> >>>> Normally
>>> >>>> this would be the end of ManifoldCF involvement, but we have also
>>> >>>> needed to
>>> >>>> override some SolrJ classes because of bugs.  So it is possible that
>>> >>>> we
>>> >>>> could fix this behavior if the problem is within the code we have
>>> >>>> changed.
>>> >>>> However, having said that, I am not sure that the differences you
>>> >>>> report are
>>> >>>> significant in any way. The w3c spec for multipart HTTP requests is
>>> >>>> what
>>> >>>> you'd want to look at for that.
>>> >>>>
>>> >>>> Please see ModifiedHttpMultipart.java for more details.
>>> >>>>
>>> >>>> Thanks,
>>> >>>> Karl
>>> >>>>
>>> >>>>
>>> >>>> On Fri, Nov 25, 2016 at 5:24 AM, Furkan KAMACI
>>> >>>> <[email protected]>
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> Hi Karl,
>>> >>>>>
>>> >>>>> I used default values for Solr. At my Solr output connector "Use the
>>> >>>>> Extract Update Handler" is clicked. Update handler is defined as:
>>> >>>>> "/update/extract". There is no Tika content extractor defined at Job
>>> >>>>> pipeline.
>>> >>>>>
>>> >>>>> I have WireShark captures and logs from both ManifoldCF and Solr. I
>>> >>>>> can
>>> >>>>> share them if you want.
>>> >>>>>
>>> >>>>> Kind Regards,
>>> >>>>> Furkan KAMACI
>>> >>>>>
>>> >>>>> On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright <[email protected]>
>>> >>>>> wrote:
>>> >>>>>>
>>> >>>>>> Is this being indexed via the extracting update handler?  What does
>>> >>>>>> your pipeline look like?  Is the tika extractor in the pipeline?
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Karl
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI
>>> >>>>>> <[email protected]> wrote:
>>> >>>>>>>
>>> >>>>>>> I've indexed a file via ManifoldCF to Solr which has a content
>>> >>>>>>> starts
>>> >>>>>>> with:
>>> >>>>>>>
>>> >>>>>>> 1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire"
>>> >>>>>>> directed by Elia Kazan, 1951
>>> >>>>>>>
>>> >>>>>>> 2. Portrait of Marlon Brando for "A Streetcar Named Desire"
>>> >>>>>>> directed
>>> >>>>>>> by Elia Kazan, 1951
>>> >>>>>>>
>>> >>>>>>> 3. Portrait of Marlon Brando for "A Streetcar Named Desire"
>>> >>>>>>> directed
>>> >>>>>>> by Elia Kazan, 1951
>>> >>>>>>>
>>> >>>>>>> However when I check Solr I see that at content:
>>> >>>>>>>
>>> >>>>>>>  " \n \nstream_source_info MARLON BRANDO.rtf
>>> >>>>>>> \nstream_content_type
>>> >>>>>>> application/rtf   \nstream_size 13580   \nstream_name MARLON
>>> >>>>>>> BRANDO.rtf
>>> >>>>>>> \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf
>>> >>>>>>> \n  \n
>>> >>>>>>> \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named
>>> >>>>>>> Desire\"
>>> >>>>>>> directed by Elia Kazan \n"
>>> >>>>>>>
>>> >>>>>>> There are 2 problems at here.
>>> >>>>>>>
>>> >>>>>>> 1) There are newline characters which are unnecessary.
>>> >>>>>>>
>>> >>>>>>> 2) There are metadata prepended to content field which should not
>>> >>>>>>> be.
>>> >>>>>>>
>>> >>>>>>> So, one can think that problem maybe at Solr or ManifoldCF
>>> >>>>>>> (related
>>> >>>>>>> to Tika). When I index same document to Solr via cURL there are
>>> >>>>>>> not new line
>>> >>>>>>> characters or metadata prepended.
>>> >>>>>>>
>>> >>>>>>> What do you think about for a solution?
>>> >>>>>>>
>>> >>>>>>> Kind Regards,
>>> >>>>>>> Furkan KAMACI
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>
>>

Re: Unnecessary Newline Characters and Metadata at Content

Reply via email to