Hi Cihad, The comparison should have been: mp.getCount() <= attachmentNumber
As for changing ":" to "/", the real problem is that these should all be ":"'s, including line 678. My apologies. I've committed the changes. Thanks, Karl On Thu, Feb 9, 2017 at 8:15 AM, Cihad Guzel <[email protected]> wrote: > Hi Karl, > > mp.getCount() is 2 > and > attachmentNumber is '0' or '1' in my case. > > Regards, > Cihad Guzel > > 2017-02-09 16:07 GMT+03:00 Cihad Guzel <[email protected]>: > >> Hi Karl, >> >> I made some changes in the code and then the indexing was done >> successfully. >> >> The changes are as follows: >> >> I have removed these lines (lines: 772-775): >> >> if (mp.getCount() >= attachmentNumber) { >> activities.deleteDocument(documentIdentifier); >> continue; >> } >> >> I updated these lines: (lines :1485 and 1586) >> int index2 = di.indexOf("/", index1 + 1); >> as like: >> int index2 = di.indexOf(":", index1 + 1); >> >> Regards, >> Cihad Guzel >> >> >> >> >> 2017-02-08 2:10 GMT+03:00 Karl Wright <[email protected]>: >> >>> Hi Cihad, >>> >>> You need to set an attachment URL template for the attachments to be >>> crawled. Open your email connection and click the "URL" tab, and you will >>> see the new field there. >>> >>> Karl >>> >>> >>> On Tue, Feb 7, 2017 at 6:07 PM, Cihad Guzel <[email protected]> wrote: >>> >>>> Hi Karl, >>>> >>>> Does not 'else' part has to be proccessed when the email has an >>>> attachment? >>>> Although the email has an attachment, only the first part was >>>> processed. Also, I don't see the attachment's content in solr index. >>>> >>>> I edited the code line for testing as follow: >>>> >>>> if (attachmentIndex == null) { >>>> // It's an email >>>> System.out.println("running if block"); >>>> ... >>>> } else { >>>> System.out.println("running else block"); >>>> // It's an attachment >>>> attachmentNumber = attachmentIndex; >>>> ... >>>> } >>>> >>>> Then, I run my job. It processed 3 times. The log looks as like: >>>> >>>> ... >>>> running if block >>>> running if block >>>> running if block >>>> ... >>>> >>>> >>>> The solr response: >>>> >>>> { >>>> "subject":["pdf test page"], >>>> "from":["Cihad Guzel <[email protected]>"], >>>> "id":"http://sampleserver/%C4%B0%C5%9F%2FmyFolder%2Ftest?id= >>>> %3CCADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw%40mai >>>> l.gmail.com%3E", >>>> "date":["Tue Feb 07 20:37:35 MSK 2017"], >>>> "mimetype":["", >>>> ""], >>>> "created_date":"2017-02-07T17:37:35.000Z", >>>> "indexed_date":"2017-02-07T21:18:05.382Z", >>>> "to":["Cihad Guzel <[email protected]>"], >>>> "modified_date":"2017-02-07T17:37:35.000Z", >>>> "encoding":["", >>>> ""], >>>> "mime_type":"text/plain", >>>> "stream_size":["null"], >>>> "x_parsed_by":["org.apache.tika.parser.DefaultParser", >>>> "org.apache.tika.parser.txt.TXTParser"], >>>> "stream_content_type":["text/plain"], >>>> "content_encoding":["windows-1252"], >>>> "content_type":["text/plain; charset=windows-1252"], >>>> "content":" \n \n \n \n \n \n \n \n \n \n >>>> --94eb2c1910841bc55f0547f43443\r\nContent-Type: >>>> multipart/alternative; boundary=94eb2c1910841bc553054 >>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: >>>> text/plain; charset=UTF-8\r\n\r\nthis is test mail for >>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: text/html; >>>> charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test mail for >>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n-- >>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf; >>>> name=\"pdf-test.pdf\"\r\nContent-Disposition: attachment; >>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding: >>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY >>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9... ", >>>> "language":"en", >>>> "_version_":1558710621053124608}] >>>> } >>>> >>>> >>>> >>>> 2017-02-08 1:17 GMT+03:00 Karl Wright <[email protected]>: >>>> >>>>> Here's the full code for this class: >>>>> >>>>> https://svn.apache.org/repos/asf/manifoldcf/trunk/connectors >>>>> /email/connector/src/main/java/org/apache/manifoldcf/crawler >>>>> /connectors/email/EmailConnector.java >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Tue, Feb 7, 2017 at 5:14 PM, Karl Wright <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Cihad, >>>>>> >>>>>> The variable attachmentIndex is *supposed* to be null except when an >>>>>> attachment is being processed. The code should look like this: >>>>>> >>>>>> if (attachmentIndex == null) { >>>>>> // It's an email >>>>>> ... >>>>>> } else { >>>>>> // It's an attachment >>>>>> attachmentNumber = attachmentIndex; >>>>>> ... >>>>>> } >>>>>> >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Tue, Feb 7, 2017 at 4:43 PM, Cihad Guzel <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Karl, >>>>>>> >>>>>>> I added LOG line for testing. It looks attachmentIndex is null. >>>>>>> >>>>>>> 2017-02-08 0:11 GMT+03:00 Karl Wright <[email protected]>: >>>>>>> >>>>>>>> I attached a second patch (to apply on top of the first patch). >>>>>>>> Please let me know if that fixes the issue. >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Feb 7, 2017 at 3:59 PM, Cihad Guzel <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Karl, >>>>>>>>> >>>>>>>>> I have an error as follow: >>>>>>>>> >>>>>>>>> FATAL 2017-02-07 23:56:09,483 (Worker thread '29') - Error tossed: >>>>>>>>> For input string: "myFolder/test:<CADNgPDgSXHeWo >>>>>>>>> [email protected]>" >>>>>>>>> java.lang.NumberFormatException: For input string: >>>>>>>>> "myFolder/test:<CADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi3 >>>>>>>>> [email protected]>" >>>>>>>>> at java.lang.NumberFormatExceptio >>>>>>>>> n.forInputString(NumberFormatException.java:65) >>>>>>>>> at java.lang.Integer.parseInt(Integer.java:580) >>>>>>>>> at java.lang.Integer.parseInt(Integer.java:615) >>>>>>>>> at org.apache.manifoldcf.crawler. >>>>>>>>> connectors.email.EmailConnector.processDocuments(EmailConnec >>>>>>>>> tor.java:705) >>>>>>>>> at org.apache.manifoldcf.crawler. >>>>>>>>> system.WorkerThread.run(WorkerThread.java:399) >>>>>>>>> >>>>>>>>> >>>>>>>>> 2017-02-07 22:50 GMT+03:00 Cihad Guzel <[email protected]>: >>>>>>>>> >>>>>>>>>> Thanks Karl, >>>>>>>>>> >>>>>>>>>> I will try it. >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> Cihad Guzel >>>>>>>>>> >>>>>>>>>> 2017-02-07 22:36 GMT+03:00 Karl Wright <[email protected]>: >>>>>>>>>> >>>>>>>>>>> I've created a ticket and attached a patch to it. >>>>>>>>>>> CONNECTORS-1375. Please let me know if it works for you; if not, >>>>>>>>>>> I'll fix >>>>>>>>>>> what doesn't work. >>>>>>>>>>> >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Feb 7, 2017 at 1:19 PM, Karl Wright <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Correction: the only metadata attribute we set is the >>>>>>>>>>>> attachment(s) mimetype (as a multivalued field) -- this doesn't >>>>>>>>>>>> currently >>>>>>>>>>>> include the attachment data. >>>>>>>>>>>> >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Feb 7, 2017 at 1:14 PM, Karl Wright <[email protected] >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Cihad, >>>>>>>>>>>>> >>>>>>>>>>>>> The email connector is providing the attachment data >>>>>>>>>>>>> unextracted to the output connector as metadata attribute data. >>>>>>>>>>>>> There are >>>>>>>>>>>>> no transformation connectors that look at this metadata. Solr >>>>>>>>>>>>> cell also >>>>>>>>>>>>> probably does not handle binary in random metadata attributes the >>>>>>>>>>>>> proper >>>>>>>>>>>>> way. >>>>>>>>>>>>> >>>>>>>>>>>>> The connector's attachment code therefore seems to be designed >>>>>>>>>>>>> only to deal with textual attachments. The right solution is to >>>>>>>>>>>>> have >>>>>>>>>>>>> individual IDs for each attachment. But that would also require >>>>>>>>>>>>> there to >>>>>>>>>>>>> be a URL we could construct for each attachment. We could >>>>>>>>>>>>> provide an >>>>>>>>>>>>> additional URI template for attachments, but I'd wonder if your >>>>>>>>>>>>> system has >>>>>>>>>>>>> the ability to serve attachments by their own URLs. Please let >>>>>>>>>>>>> me know if >>>>>>>>>>>>> this would work and if so I can create a ticket and work on >>>>>>>>>>>>> making these >>>>>>>>>>>>> changes. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Karl >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Feb 7, 2017 at 12:56 PM, Cihad Guzel < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I try the email connector with gmail. I attach the file [1] >>>>>>>>>>>>>> in my new email. And sent to my test email adress. >>>>>>>>>>>>>> >>>>>>>>>>>>>> My mail content body is like: "this is test mail for mfc" >>>>>>>>>>>>>> >>>>>>>>>>>>>> Then I run my email job and the email is indexed to Solr >>>>>>>>>>>>>> successfully. But, the solr's content field have not my >>>>>>>>>>>>>> attachment's >>>>>>>>>>>>>> content body. Solr content filed looks like: >>>>>>>>>>>>>> >>>>>>>>>>>>>> "content":" \n \n \n \n \n \n \n \n \n \n >>>>>>>>>>>>>> --94eb2c1910841bc55f0547f43443\r\nContent-Type: >>>>>>>>>>>>>> multipart/alternative; boundary=94eb2c1910841bc553054 >>>>>>>>>>>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: >>>>>>>>>>>>>> text/plain; charset=UTF-8\r\n\r\nthis is test mail for >>>>>>>>>>>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: >>>>>>>>>>>>>> text/html; charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test >>>>>>>>>>>>>> mail for >>>>>>>>>>>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n-- >>>>>>>>>>>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: >>>>>>>>>>>>>> application/pdf; name=\"pdf-test.pdf\"\r\nContent-Disposition: >>>>>>>>>>>>>> attachment; >>>>>>>>>>>>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding: >>>>>>>>>>>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY >>>>>>>>>>>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9MIDIwNTk3L08gNDA >>>>>>>>>>>>>> vRSAx\r\nNDExNS9OIDEvVCAxOTc5NS9IIFsgMTAwNSAyMTVdPj4NZW5kb2J >>>>>>>>>>>>>> qDSAgICAgICAgICAgICAgICAg\r\nDQp4cmVmDQozNyAzNA0KMDAwMDAwMDA >>>>>>>>>>>>>> xNiAwMDAwMCBuDQowMDAwMDAxMzg2IDAwMDAwIG4NCjAw\r\nMDAwMDE1MjIgMDAwM >>>>>>>>>>>>>> ..." >>>>>>>>>>>>>> >>>>>>>>>>>>>> Does the MFC email connector know that the attachment's file >>>>>>>>>>>>>> type is pdf? Does not extract the contents? >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] http://www.orimi.com/pdf-test.pdf >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Regards >>>>>>>>>>>>>> Cihad Güzel >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Teşekkürler >>>>>>>>>> Cihad Güzel >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Teşekkürler >>>>>>>>> Cihad Güzel >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Teşekkürler >>>>>>> Cihad Güzel >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Teşekkürler >>>> Cihad Güzel >>>> >>> >>> >> >> >> -- >> Teşekkürler >> Cihad Güzel >> > > > > -- > Teşekkürler > Cihad Güzel >
