Re: Unnecessary Newline Characters and Metadata at Content

Furkan KAMACI Fri, 25 Nov 2016 10:54:35 -0800

Hi Karl,

Could you try to test MFC with Solr? I cannot see content field either with
Windows Shares or File System with Solr 4.x, 5.x, 6.x. Only Solr 4.x have
content and it is as I defined. Code part of sending content as a stream
may have some problems.


Kind Regards,
Furkan KAMACI


On Fri, Nov 25, 2016 at 4:13 PM, Furkan KAMACI <[email protected]>
wrote:

> Hi Karl,
>
> By the way, I've tried different versions of Solr and couldn't get content
> or got as I've explained. When I checkout the MFC trunk which uses Solr
> 6.3.0 and when I use Solr 6.3.0 as output connector I can see documents are
> indexed but I cannot even see "content" field.
>
> Kind Regards,
> Furkan KAMACI
>
> On Fri, Nov 25, 2016 at 2:01 PM, Karl Wright <[email protected]> wrote:
>
>> Hi Furkan,
>>
>> The following code is used to set up a SolrJ object that is then later
>> converted to a post request:
>>
>> >>>>>>
>>     private void buildExtractUpdateHandlerRequest( long length,
>> InputStream is, String contentType,
>>       String contentName,
>>       ContentStreamUpdateRequest contentStreamUpdateRequest )
>>       throws IOException
>>     {
>>       ModifiableSolrParams out = new ModifiableSolrParams();
>>
>>       // Write the id field
>>       writeField(out,LITERAL+idAttributeName,documentURI);
>>       // Write the rest of the attributes
>>       if (originalSizeAttributeName != null)
>>       {
>>         Long size = document.getOriginalSize();
>>         if (size != null)
>>           // Write value
>>           writeField(out,LITERAL+originalSizeAttributeName,size.
>> toString());
>>       }
>>       if (modifiedDateAttributeName != null)
>>       {
>>         Date date = document.getModifiedDate();
>>         if (date != null)
>>           // Write value
>>           writeField(out,LITERAL+modifiedDateAttributeName,DateParser.
>> formatISO8601Date(date));
>>       }
>>       if (createdDateAttributeName != null)
>>       {
>>         Date date = document.getCreatedDate();
>>         if (date != null)
>>           // Write value
>>           writeField(out,LITERAL+createdDateAttributeName,DateParser.
>> formatISO8601Date(date));
>>       }
>>       if (indexedDateAttributeName != null)
>>       {
>>         Date date = document.getIndexingDate();
>>         if (date != null)
>>           // Write value
>>           writeField(out,LITERAL+indexedDateAttributeName,DateParser.
>> formatISO8601Date(date));
>>       }
>>       if (fileNameAttributeName != null)
>>       {
>>         String fileName = document.getFileName();
>>         if (!StringUtils.isBlank(fileName))
>>           writeField(out,LITERAL+fileNameAttributeName,fileName);
>>       }
>>       if (mimeTypeAttributeName != null)
>>       {
>>         String mimeType = document.getMimeType();
>>         if (!StringUtils.isBlank(mimeType))
>>           writeField(out,LITERAL+mimeTypeAttributeName,mimeType);
>>       }
>>
>>       // Write the access token information
>>       // Both maps have the same keys.
>>       Iterator<String> typeIterator = aclsMap.keySet().iterator();
>>       while (typeIterator.hasNext())
>>       {
>>         String aclType = typeIterator.next();
>>         writeACLs(out,aclType,aclsMap.get(aclType),denyAclsMap.get(a
>> clType));
>>       }
>>
>>       // Write the arguments
>>       for (String name : arguments.keySet())
>>       {
>>         List<String> values = arguments.get(name);
>>         writeField(out,name,values);
>>       }
>>
>>       // Write the metadata, each in a field by itself
>>       buildSolrParamsFromMetadata(out);
>>
>>       // These are unnecessary now in the case of non-solrcloud setups,
>> because we overrode the SolrJ posting method to use multipart.
>>       //writeField(out,LITERAL+"stream_size",String.valueOf(length));
>>       //writeField(out,LITERAL+"stream_name",document.getFileName());
>>
>>       // General hint for Tika
>>       if (!StringUtils.isBlank(document.getFileName()))
>>         writeField(out,"resource.name",document.getFileName());
>>
>>       // Write the commitWithin parameter
>>       if (commitWithin != null)
>>         writeField(out,COMMITWITHIN_METADATA,commitWithin);
>>
>>       contentStreamUpdateRequest.setParams(out);
>>
>>       contentStreamUpdateRequest.addContentStream(new
>> RepositoryDocumentStream(is,length,contentType,contentName));
>>     }
>> <<<<<<
>>
>> The ContentStreamUpdateRequest object is defined within SolrJ.  Normally
>> this would be the end of ManifoldCF involvement, but we have also needed to
>> override some SolrJ classes because of bugs.  So it is possible that we
>> could fix this behavior if the problem is within the code we have changed.
>> However, having said that, I am not sure that the differences you report
>> are significant in any way. The w3c spec for multipart HTTP requests is
>> what you'd want to look at for that.
>>
>> Please see ModifiedHttpMultipart.java for more details.
>>
>> Thanks,
>> Karl
>>
>>
>> On Fri, Nov 25, 2016 at 5:24 AM, Furkan KAMACI <[email protected]>
>> wrote:
>>
>>> Hi Karl,
>>>
>>> I used default values for Solr. At my Solr output connector "Use the
>>> Extract Update Handler" is clicked. Update handler is defined as:
>>> "/update/extract". There is no Tika content extractor defined at Job
>>> pipeline.
>>>
>>> I have WireShark captures and logs from both ManifoldCF and Solr. I can
>>> share them if you want.
>>>
>>> Kind Regards,
>>> Furkan KAMACI
>>>
>>> On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright <[email protected]>
>>> wrote:
>>>
>>>> Is this being indexed via the extracting update handler?  What does
>>>> your pipeline look like?  Is the tika extractor in the pipeline?
>>>>
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI <[email protected]
>>>> > wrote:
>>>>
>>>>> I've indexed a file via ManifoldCF to Solr which has a content starts
>>>>> with:
>>>>>
>>>>> *1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire"
>>>>> directed by Elia Kazan, 1951*
>>>>>
>>>>> *2. Portrait of Marlon Brando for "A Streetcar Named Desire" directed
>>>>> by Elia Kazan, 1951*
>>>>>
>>>>> *3. Portrait of Marlon Brando for "A Streetcar Named Desire" directed
>>>>> by Elia Kazan, 1951*
>>>>>
>>>>> However when I check Solr I see that at content:
>>>>>
>>>>> * " \n \nstream_source_info MARLON BRANDO.rtf   \nstream_content_type
>>>>> application/rtf   \nstream_size 13580   \nstream_name MARLON BRANDO.rtf
>>>>> \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf   \n  \n
>>>>> \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\"
>>>>> directed by Elia Kazan \n"*
>>>>>
>>>>> There are 2 problems at here.
>>>>>
>>>>> 1) There are newline characters which are unnecessary.
>>>>>
>>>>> 2) There are metadata prepended to content field which should not be.
>>>>>
>>>>> So, one can think that problem maybe at Solr or ManifoldCF (related to
>>>>> Tika). When I index same document to Solr via cURL there are not new line
>>>>> characters or metadata prepended.
>>>>>
>>>>> What do you think about for a solution?
>>>>>
>>>>> Kind Regards,
>>>>> Furkan KAMACI
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Unnecessary Newline Characters and Metadata at Content

Reply via email to