Hi Cihad,
The comparison should have been:

mp.getCount() <= attachmentNumber

As for changing ":" to "/", the real problem is that these should all be
":"'s, including line 678.  My apologies.  I've committed the changes.

Thanks,
Karl


On Thu, Feb 9, 2017 at 8:15 AM, Cihad Guzel <[email protected]> wrote:

> Hi Karl,
>
> mp.getCount() is 2
> and
> attachmentNumber is '0' or '1' in my case.
>
> Regards,
> Cihad Guzel
>
> 2017-02-09 16:07 GMT+03:00 Cihad Guzel <[email protected]>:
>
>> Hi Karl,
>>
>> I made some changes in the code and then the indexing was done
>> successfully.
>>
>> The changes are as follows:
>>
>> I have removed these lines (lines: 772-775):
>>
>>              if (mp.getCount() >= attachmentNumber) {
>>                 activities.deleteDocument(documentIdentifier);
>>                 continue;
>>               }
>>
>> I updated these lines: (lines :1485 and 1586)
>>       int index2 = di.indexOf("/", index1 + 1);
>> as like:
>>       int index2 = di.indexOf(":", index1 + 1);
>>
>> Regards,
>> Cihad Guzel
>>
>>
>>
>>
>> 2017-02-08 2:10 GMT+03:00 Karl Wright <[email protected]>:
>>
>>> Hi Cihad,
>>>
>>> You need to set an attachment URL template for the attachments to be
>>> crawled.  Open your email connection and click the "URL" tab, and you will
>>> see the new field there.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Feb 7, 2017 at 6:07 PM, Cihad Guzel <[email protected]> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> Does not 'else' part has to be proccessed when the email has an
>>>> attachment?
>>>> Although the email has an attachment, only the first part was
>>>> processed. Also, I don't see the attachment's content in solr index.
>>>>
>>>> I edited the code line for testing as follow:
>>>>
>>>>  if (attachmentIndex == null) {
>>>>           // It's an email
>>>>           System.out.println("running if block");
>>>> ...
>>>>         } else {
>>>>           System.out.println("running else block");
>>>>           // It's an attachment
>>>>           attachmentNumber = attachmentIndex;
>>>> ...
>>>>         }
>>>>
>>>> Then, I run my job. It processed 3 times. The log looks as like:
>>>>
>>>> ...
>>>> running if block
>>>> running if block
>>>> running if block
>>>> ...
>>>>
>>>>
>>>> The solr response:
>>>>
>>>> {
>>>>         "subject":["pdf test page"],
>>>>         "from":["Cihad Guzel <[email protected]>"],
>>>>         "id":"http://sampleserver/%C4%B0%C5%9F%2FmyFolder%2Ftest?id=
>>>> %3CCADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw%40mai
>>>> l.gmail.com%3E",
>>>>         "date":["Tue Feb 07 20:37:35 MSK 2017"],
>>>>         "mimetype":["",
>>>>           ""],
>>>>         "created_date":"2017-02-07T17:37:35.000Z",
>>>>         "indexed_date":"2017-02-07T21:18:05.382Z",
>>>>         "to":["Cihad Guzel <[email protected]>"],
>>>>         "modified_date":"2017-02-07T17:37:35.000Z",
>>>>         "encoding":["",
>>>>           ""],
>>>>         "mime_type":"text/plain",
>>>>         "stream_size":["null"],
>>>>         "x_parsed_by":["org.apache.tika.parser.DefaultParser",
>>>>           "org.apache.tika.parser.txt.TXTParser"],
>>>>         "stream_content_type":["text/plain"],
>>>>         "content_encoding":["windows-1252"],
>>>>         "content_type":["text/plain; charset=windows-1252"],
>>>>         "content":" \n \n  \n  \n  \n  \n  \n  \n  \n \n
>>>>  --94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>> multipart/alternative; boundary=94eb2c1910841bc553054
>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>> text/plain; charset=UTF-8\r\n\r\nthis is test mail for
>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: text/html;
>>>> charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test mail for
>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--
>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf;
>>>> name=\"pdf-test.pdf\"\r\nContent-Disposition: attachment;
>>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY
>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9... ",
>>>>         "language":"en",
>>>>         "_version_":1558710621053124608}]
>>>>   }
>>>>
>>>>
>>>>
>>>> 2017-02-08 1:17 GMT+03:00 Karl Wright <[email protected]>:
>>>>
>>>>> Here's the full code for this class:
>>>>>
>>>>> https://svn.apache.org/repos/asf/manifoldcf/trunk/connectors
>>>>> /email/connector/src/main/java/org/apache/manifoldcf/crawler
>>>>> /connectors/email/EmailConnector.java
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Feb 7, 2017 at 5:14 PM, Karl Wright <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Cihad,
>>>>>>
>>>>>> The variable attachmentIndex is *supposed* to be null except when an
>>>>>> attachment is being processed.  The code should look like this:
>>>>>>
>>>>>>         if (attachmentIndex == null) {
>>>>>>           // It's an email
>>>>>> ...
>>>>>>         } else {
>>>>>>           // It's an attachment
>>>>>>           attachmentNumber = attachmentIndex;
>>>>>> ...
>>>>>>         }
>>>>>>
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 7, 2017 at 4:43 PM, Cihad Guzel <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Karl,
>>>>>>>
>>>>>>> I added LOG line for testing. It looks attachmentIndex is null.
>>>>>>>
>>>>>>> 2017-02-08 0:11 GMT+03:00 Karl Wright <[email protected]>:
>>>>>>>
>>>>>>>> I attached a second patch (to apply on top of the first patch).
>>>>>>>> Please let me know if that fixes the issue.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 7, 2017 at 3:59 PM, Cihad Guzel <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Karl,
>>>>>>>>>
>>>>>>>>> I have an error as follow:
>>>>>>>>>
>>>>>>>>> FATAL 2017-02-07 23:56:09,483 (Worker thread '29') - Error tossed:
>>>>>>>>> For input string: "myFolder/test:<CADNgPDgSXHeWo
>>>>>>>>> [email protected]>"
>>>>>>>>> java.lang.NumberFormatException: For input string:
>>>>>>>>> "myFolder/test:<CADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi3
>>>>>>>>> [email protected]>"
>>>>>>>>>         at java.lang.NumberFormatExceptio
>>>>>>>>> n.forInputString(NumberFormatException.java:65)
>>>>>>>>>         at java.lang.Integer.parseInt(Integer.java:580)
>>>>>>>>>         at java.lang.Integer.parseInt(Integer.java:615)
>>>>>>>>>         at org.apache.manifoldcf.crawler.
>>>>>>>>> connectors.email.EmailConnector.processDocuments(EmailConnec
>>>>>>>>> tor.java:705)
>>>>>>>>>         at org.apache.manifoldcf.crawler.
>>>>>>>>> system.WorkerThread.run(WorkerThread.java:399)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2017-02-07 22:50 GMT+03:00 Cihad Guzel <[email protected]>:
>>>>>>>>>
>>>>>>>>>> Thanks Karl,
>>>>>>>>>>
>>>>>>>>>> I will try it.
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Cihad Guzel
>>>>>>>>>>
>>>>>>>>>> 2017-02-07 22:36 GMT+03:00 Karl Wright <[email protected]>:
>>>>>>>>>>
>>>>>>>>>>> I've created a ticket and attached a patch to it.
>>>>>>>>>>> CONNECTORS-1375.  Please let me know if it works for you; if not, 
>>>>>>>>>>> I'll fix
>>>>>>>>>>> what doesn't work.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 7, 2017 at 1:19 PM, Karl Wright <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Correction: the only metadata attribute we set is the
>>>>>>>>>>>> attachment(s) mimetype (as a multivalued field) -- this doesn't 
>>>>>>>>>>>> currently
>>>>>>>>>>>> include the attachment data.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 7, 2017 at 1:14 PM, Karl Wright <[email protected]
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Cihad,
>>>>>>>>>>>>>
>>>>>>>>>>>>> The email connector is providing the attachment data
>>>>>>>>>>>>> unextracted to the output connector as metadata attribute data.  
>>>>>>>>>>>>> There are
>>>>>>>>>>>>> no transformation connectors that look at this metadata.  Solr 
>>>>>>>>>>>>> cell also
>>>>>>>>>>>>> probably does not handle binary in random metadata attributes the 
>>>>>>>>>>>>> proper
>>>>>>>>>>>>> way.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The connector's attachment code therefore seems to be designed
>>>>>>>>>>>>> only to deal with textual attachments.  The right solution is to 
>>>>>>>>>>>>> have
>>>>>>>>>>>>> individual IDs for each attachment.  But that would also require 
>>>>>>>>>>>>> there to
>>>>>>>>>>>>> be a URL we could construct for each attachment.  We could 
>>>>>>>>>>>>> provide an
>>>>>>>>>>>>> additional URI template for attachments, but I'd wonder if your 
>>>>>>>>>>>>> system has
>>>>>>>>>>>>> the ability to serve attachments by their own URLs.  Please let 
>>>>>>>>>>>>> me know if
>>>>>>>>>>>>> this would work and if so I can create a ticket and work on 
>>>>>>>>>>>>> making these
>>>>>>>>>>>>> changes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 7, 2017 at 12:56 PM, Cihad Guzel <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I try the email connector with gmail. I attach the file [1]
>>>>>>>>>>>>>> in my new email. And sent to my test email adress.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My mail content body is like: "this is test mail for mfc"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Then I run my email job and the email is indexed to Solr
>>>>>>>>>>>>>> successfully. But, the solr's content field have not my 
>>>>>>>>>>>>>> attachment's
>>>>>>>>>>>>>> content body. Solr content filed looks like:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> "content":" \n \n  \n  \n  \n  \n  \n  \n  \n \n
>>>>>>>>>>>>>>  --94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>>>>>>>>>>>> multipart/alternative; boundary=94eb2c1910841bc553054
>>>>>>>>>>>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>>>>>>>> text/plain; charset=UTF-8\r\n\r\nthis is test mail for
>>>>>>>>>>>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>>>>>>>> text/html; charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test 
>>>>>>>>>>>>>> mail for
>>>>>>>>>>>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--
>>>>>>>>>>>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>>>>>>>>>>>> application/pdf; name=\"pdf-test.pdf\"\r\nContent-Disposition:
>>>>>>>>>>>>>> attachment; 
>>>>>>>>>>>>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
>>>>>>>>>>>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY
>>>>>>>>>>>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9MIDIwNTk3L08gNDA
>>>>>>>>>>>>>> vRSAx\r\nNDExNS9OIDEvVCAxOTc5NS9IIFsgMTAwNSAyMTVdPj4NZW5kb2J
>>>>>>>>>>>>>> qDSAgICAgICAgICAgICAgICAg\r\nDQp4cmVmDQozNyAzNA0KMDAwMDAwMDA
>>>>>>>>>>>>>> xNiAwMDAwMCBuDQowMDAwMDAxMzg2IDAwMDAwIG4NCjAw\r\nMDAwMDE1MjIgMDAwM
>>>>>>>>>>>>>> ..."
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Does the MFC email connector know that the attachment's file
>>>>>>>>>>>>>> type is pdf? Does not extract the contents?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1] http://www.orimi.com/pdf-test.pdf
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> Cihad Güzel
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Teşekkürler
>>>>>>>>>> Cihad Güzel
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Teşekkürler
>>>>>>>>> Cihad Güzel
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Teşekkürler
>>>>>>> Cihad Güzel
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Teşekkürler
>>>> Cihad Güzel
>>>>
>>>
>>>
>>
>>
>> --
>> Teşekkürler
>> Cihad Güzel
>>
>
>
>
> --
> Teşekkürler
> Cihad Güzel
>

Reply via email to