Re: Unnecessary Newline Characters and Metadata at Content

Karl Wright Fri, 25 Nov 2016 11:11:53 -0800

I am on vacation today and have other responsibilities.  However, I believe
Shinichiro Abe might be able to test this out.  He redid the Solr
integration for SolrJ 6.3.


Thanks,
Karl


On Fri, Nov 25, 2016 at 1:54 PM, Furkan KAMACI <[email protected]>
wrote:

> Hi Karl,
>
> Could you try to test MFC with Solr? I cannot see content field either
> with Windows Shares or File System with Solr 4.x, 5.x, 6.x. Only Solr 4.x
> have content and it is as I defined. Code part of sending content as a
> stream may have some problems.
>
> Kind Regards,
> Furkan KAMACI
>
>
> On Fri, Nov 25, 2016 at 4:13 PM, Furkan KAMACI <[email protected]>
> wrote:
>
>> Hi Karl,
>>
>> By the way, I've tried different versions of Solr and couldn't get
>> content or got as I've explained. When I checkout the MFC trunk which uses
>> Solr 6.3.0 and when I use Solr 6.3.0 as output connector I can see
>> documents are indexed but I cannot even see "content" field.
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>> On Fri, Nov 25, 2016 at 2:01 PM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Furkan,
>>>
>>> The following code is used to set up a SolrJ object that is then later
>>> converted to a post request:
>>>
>>> >>>>>>
>>>     private void buildExtractUpdateHandlerRequest( long length,
>>> InputStream is, String contentType,
>>>       String contentName,
>>>       ContentStreamUpdateRequest contentStreamUpdateRequest )
>>>       throws IOException
>>>     {
>>>       ModifiableSolrParams out = new ModifiableSolrParams();
>>>
>>>       // Write the id field
>>>       writeField(out,LITERAL+idAttributeName,documentURI);
>>>       // Write the rest of the attributes
>>>       if (originalSizeAttributeName != null)
>>>       {
>>>         Long size = document.getOriginalSize();
>>>         if (size != null)
>>>           // Write value
>>>           writeField(out,LITERAL+originalSizeAttributeName,size.toStri
>>> ng());
>>>       }
>>>       if (modifiedDateAttributeName != null)
>>>       {
>>>         Date date = document.getModifiedDate();
>>>         if (date != null)
>>>           // Write value
>>>           writeField(out,LITERAL+modifiedDateAttributeName,DateParser.
>>> formatISO8601Date(date));
>>>       }
>>>       if (createdDateAttributeName != null)
>>>       {
>>>         Date date = document.getCreatedDate();
>>>         if (date != null)
>>>           // Write value
>>>           writeField(out,LITERAL+createdDateAttributeName,DateParser.f
>>> ormatISO8601Date(date));
>>>       }
>>>       if (indexedDateAttributeName != null)
>>>       {
>>>         Date date = document.getIndexingDate();
>>>         if (date != null)
>>>           // Write value
>>>           writeField(out,LITERAL+indexedDateAttributeName,DateParser.f
>>> ormatISO8601Date(date));
>>>       }
>>>       if (fileNameAttributeName != null)
>>>       {
>>>         String fileName = document.getFileName();
>>>         if (!StringUtils.isBlank(fileName))
>>>           writeField(out,LITERAL+fileNameAttributeName,fileName);
>>>       }
>>>       if (mimeTypeAttributeName != null)
>>>       {
>>>         String mimeType = document.getMimeType();
>>>         if (!StringUtils.isBlank(mimeType))
>>>           writeField(out,LITERAL+mimeTypeAttributeName,mimeType);
>>>       }
>>>
>>>       // Write the access token information
>>>       // Both maps have the same keys.
>>>       Iterator<String> typeIterator = aclsMap.keySet().iterator();
>>>       while (typeIterator.hasNext())
>>>       {
>>>         String aclType = typeIterator.next();
>>>         writeACLs(out,aclType,aclsMap.get(aclType),denyAclsMap.get(a
>>> clType));
>>>       }
>>>
>>>       // Write the arguments
>>>       for (String name : arguments.keySet())
>>>       {
>>>         List<String> values = arguments.get(name);
>>>         writeField(out,name,values);
>>>       }
>>>
>>>       // Write the metadata, each in a field by itself
>>>       buildSolrParamsFromMetadata(out);
>>>
>>>       // These are unnecessary now in the case of non-solrcloud setups,
>>> because we overrode the SolrJ posting method to use multipart.
>>>       //writeField(out,LITERAL+"stream_size",String.valueOf(length));
>>>       //writeField(out,LITERAL+"stream_name",document.getFileName());
>>>
>>>       // General hint for Tika
>>>       if (!StringUtils.isBlank(document.getFileName()))
>>>         writeField(out,"resource.name",document.getFileName());
>>>
>>>       // Write the commitWithin parameter
>>>       if (commitWithin != null)
>>>         writeField(out,COMMITWITHIN_METADATA,commitWithin);
>>>
>>>       contentStreamUpdateRequest.setParams(out);
>>>
>>>       contentStreamUpdateRequest.addContentStream(new
>>> RepositoryDocumentStream(is,length,contentType,contentName));
>>>     }
>>> <<<<<<
>>>
>>> The ContentStreamUpdateRequest object is defined within SolrJ.  Normally
>>> this would be the end of ManifoldCF involvement, but we have also needed to
>>> override some SolrJ classes because of bugs.  So it is possible that we
>>> could fix this behavior if the problem is within the code we have changed.
>>> However, having said that, I am not sure that the differences you report
>>> are significant in any way. The w3c spec for multipart HTTP requests is
>>> what you'd want to look at for that.
>>>
>>> Please see ModifiedHttpMultipart.java for more details.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Fri, Nov 25, 2016 at 5:24 AM, Furkan KAMACI <[email protected]>
>>> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> I used default values for Solr. At my Solr output connector "Use the
>>>> Extract Update Handler" is clicked. Update handler is defined as:
>>>> "/update/extract". There is no Tika content extractor defined at Job
>>>> pipeline.
>>>>
>>>> I have WireShark captures and logs from both ManifoldCF and Solr. I can
>>>> share them if you want.
>>>>
>>>> Kind Regards,
>>>> Furkan KAMACI
>>>>
>>>> On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright <[email protected]>
>>>> wrote:
>>>>
>>>>> Is this being indexed via the extracting update handler?  What does
>>>>> your pipeline look like?  Is the tika extractor in the pipeline?
>>>>>
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I've indexed a file via ManifoldCF to Solr which has a content starts
>>>>>> with:
>>>>>>
>>>>>> *1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire"
>>>>>> directed by Elia Kazan, 1951*
>>>>>>
>>>>>> *2. Portrait of Marlon Brando for "A Streetcar Named Desire" directed
>>>>>> by Elia Kazan, 1951*
>>>>>>
>>>>>> *3. Portrait of Marlon Brando for "A Streetcar Named Desire" directed
>>>>>> by Elia Kazan, 1951*
>>>>>>
>>>>>> However when I check Solr I see that at content:
>>>>>>
>>>>>> * " \n \nstream_source_info MARLON BRANDO.rtf   \nstream_content_type
>>>>>> application/rtf   \nstream_size 13580   \nstream_name MARLON BRANDO.rtf
>>>>>> \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf   \n  
>>>>>> \n
>>>>>> \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\"
>>>>>> directed by Elia Kazan \n"*
>>>>>>
>>>>>> There are 2 problems at here.
>>>>>>
>>>>>> 1) There are newline characters which are unnecessary.
>>>>>>
>>>>>> 2) There are metadata prepended to content field which should not be.
>>>>>>
>>>>>> So, one can think that problem maybe at Solr or ManifoldCF (related
>>>>>> to Tika). When I index same document to Solr via cURL there are not new
>>>>>> line characters or metadata prepended.
>>>>>>
>>>>>> What do you think about for a solution?
>>>>>>
>>>>>> Kind Regards,
>>>>>> Furkan KAMACI
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Unnecessary Newline Characters and Metadata at Content

Reply via email to