Thanks Karl. Regards, Cihad Guzel
2017-02-09 16:27 GMT+03:00 Karl Wright <[email protected]>: > Hi Cihad, > The comparison should have been: > > mp.getCount() <= attachmentNumber > > As for changing ":" to "/", the real problem is that these should all be > ":"'s, including line 678. My apologies. I've committed the changes. > > Thanks, > Karl > > > On Thu, Feb 9, 2017 at 8:15 AM, Cihad Guzel <[email protected]> wrote: > >> Hi Karl, >> >> mp.getCount() is 2 >> and >> attachmentNumber is '0' or '1' in my case. >> >> Regards, >> Cihad Guzel >> >> 2017-02-09 16:07 GMT+03:00 Cihad Guzel <[email protected]>: >> >>> Hi Karl, >>> >>> I made some changes in the code and then the indexing was done >>> successfully. >>> >>> The changes are as follows: >>> >>> I have removed these lines (lines: 772-775): >>> >>> if (mp.getCount() >= attachmentNumber) { >>> activities.deleteDocument(documentIdentifier); >>> continue; >>> } >>> >>> I updated these lines: (lines :1485 and 1586) >>> int index2 = di.indexOf("/", index1 + 1); >>> as like: >>> int index2 = di.indexOf(":", index1 + 1); >>> >>> Regards, >>> Cihad Guzel >>> >>> >>> >>> >>> 2017-02-08 2:10 GMT+03:00 Karl Wright <[email protected]>: >>> >>>> Hi Cihad, >>>> >>>> You need to set an attachment URL template for the attachments to be >>>> crawled. Open your email connection and click the "URL" tab, and you will >>>> see the new field there. >>>> >>>> Karl >>>> >>>> >>>> On Tue, Feb 7, 2017 at 6:07 PM, Cihad Guzel <[email protected]> wrote: >>>> >>>>> Hi Karl, >>>>> >>>>> Does not 'else' part has to be proccessed when the email has an >>>>> attachment? >>>>> Although the email has an attachment, only the first part was >>>>> processed. Also, I don't see the attachment's content in solr index. >>>>> >>>>> I edited the code line for testing as follow: >>>>> >>>>> if (attachmentIndex == null) { >>>>> // It's an email >>>>> System.out.println("running if block"); >>>>> ... >>>>> } else { >>>>> System.out.println("running else block"); >>>>> // It's an attachment >>>>> attachmentNumber = attachmentIndex; >>>>> ... >>>>> } >>>>> >>>>> Then, I run my job. It processed 3 times. The log looks as like: >>>>> >>>>> ... >>>>> running if block >>>>> running if block >>>>> running if block >>>>> ... >>>>> >>>>> >>>>> The solr response: >>>>> >>>>> { >>>>> "subject":["pdf test page"], >>>>> "from":["Cihad Guzel <[email protected]>"], >>>>> "id":"http://sampleserver/%C4%B0%C5%9F%2FmyFolder%2Ftest?id= >>>>> %3CCADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw%40mai >>>>> l.gmail.com%3E", >>>>> "date":["Tue Feb 07 20:37:35 MSK 2017"], >>>>> "mimetype":["", >>>>> ""], >>>>> "created_date":"2017-02-07T17:37:35.000Z", >>>>> "indexed_date":"2017-02-07T21:18:05.382Z", >>>>> "to":["Cihad Guzel <[email protected]>"], >>>>> "modified_date":"2017-02-07T17:37:35.000Z", >>>>> "encoding":["", >>>>> ""], >>>>> "mime_type":"text/plain", >>>>> "stream_size":["null"], >>>>> "x_parsed_by":["org.apache.tika.parser.DefaultParser", >>>>> "org.apache.tika.parser.txt.TXTParser"], >>>>> "stream_content_type":["text/plain"], >>>>> "content_encoding":["windows-1252"], >>>>> "content_type":["text/plain; charset=windows-1252"], >>>>> "content":" \n \n \n \n \n \n \n \n \n \n >>>>> --94eb2c1910841bc55f0547f43443\r\nContent-Type: >>>>> multipart/alternative; boundary=94eb2c1910841bc553054 >>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: >>>>> text/plain; charset=UTF-8\r\n\r\nthis is test mail for >>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: >>>>> text/html; charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test mail for >>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n-- >>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf; >>>>> name=\"pdf-test.pdf\"\r\nContent-Disposition: attachment; >>>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding: >>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY >>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9... ", >>>>> "language":"en", >>>>> "_version_":1558710621053124608}] >>>>> } >>>>> >>>>> >>>>> >>>>> 2017-02-08 1:17 GMT+03:00 Karl Wright <[email protected]>: >>>>> >>>>>> Here's the full code for this class: >>>>>> >>>>>> https://svn.apache.org/repos/asf/manifoldcf/trunk/connectors >>>>>> /email/connector/src/main/java/org/apache/manifoldcf/crawler >>>>>> /connectors/email/EmailConnector.java >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Tue, Feb 7, 2017 at 5:14 PM, Karl Wright <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Cihad, >>>>>>> >>>>>>> The variable attachmentIndex is *supposed* to be null except when an >>>>>>> attachment is being processed. The code should look like this: >>>>>>> >>>>>>> if (attachmentIndex == null) { >>>>>>> // It's an email >>>>>>> ... >>>>>>> } else { >>>>>>> // It's an attachment >>>>>>> attachmentNumber = attachmentIndex; >>>>>>> ... >>>>>>> } >>>>>>> >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Tue, Feb 7, 2017 at 4:43 PM, Cihad Guzel <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Karl, >>>>>>>> >>>>>>>> I added LOG line for testing. It looks attachmentIndex is null. >>>>>>>> >>>>>>>> 2017-02-08 0:11 GMT+03:00 Karl Wright <[email protected]>: >>>>>>>> >>>>>>>>> I attached a second patch (to apply on top of the first patch). >>>>>>>>> Please let me know if that fixes the issue. >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Feb 7, 2017 at 3:59 PM, Cihad Guzel <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Karl, >>>>>>>>>> >>>>>>>>>> I have an error as follow: >>>>>>>>>> >>>>>>>>>> FATAL 2017-02-07 23:56:09,483 (Worker thread '29') - Error >>>>>>>>>> tossed: For input string: "myFolder/test:<CADNgPDgSXHeWo >>>>>>>>>> [email protected]>" >>>>>>>>>> java.lang.NumberFormatException: For input string: >>>>>>>>>> "myFolder/test:<CADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi3 >>>>>>>>>> [email protected]>" >>>>>>>>>> at java.lang.NumberFormatExceptio >>>>>>>>>> n.forInputString(NumberFormatException.java:65) >>>>>>>>>> at java.lang.Integer.parseInt(Integer.java:580) >>>>>>>>>> at java.lang.Integer.parseInt(Integer.java:615) >>>>>>>>>> at org.apache.manifoldcf.crawler. >>>>>>>>>> connectors.email.EmailConnector.processDocuments(EmailConnec >>>>>>>>>> tor.java:705) >>>>>>>>>> at org.apache.manifoldcf.crawler. >>>>>>>>>> system.WorkerThread.run(WorkerThread.java:399) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2017-02-07 22:50 GMT+03:00 Cihad Guzel <[email protected]>: >>>>>>>>>> >>>>>>>>>>> Thanks Karl, >>>>>>>>>>> >>>>>>>>>>> I will try it. >>>>>>>>>>> >>>>>>>>>>> Regards >>>>>>>>>>> Cihad Guzel >>>>>>>>>>> >>>>>>>>>>> 2017-02-07 22:36 GMT+03:00 Karl Wright <[email protected]>: >>>>>>>>>>> >>>>>>>>>>>> I've created a ticket and attached a patch to it. >>>>>>>>>>>> CONNECTORS-1375. Please let me know if it works for you; if not, >>>>>>>>>>>> I'll fix >>>>>>>>>>>> what doesn't work. >>>>>>>>>>>> >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Feb 7, 2017 at 1:19 PM, Karl Wright <[email protected] >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Correction: the only metadata attribute we set is the >>>>>>>>>>>>> attachment(s) mimetype (as a multivalued field) -- this doesn't >>>>>>>>>>>>> currently >>>>>>>>>>>>> include the attachment data. >>>>>>>>>>>>> >>>>>>>>>>>>> Karl >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Feb 7, 2017 at 1:14 PM, Karl Wright < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Cihad, >>>>>>>>>>>>>> >>>>>>>>>>>>>> The email connector is providing the attachment data >>>>>>>>>>>>>> unextracted to the output connector as metadata attribute data. >>>>>>>>>>>>>> There are >>>>>>>>>>>>>> no transformation connectors that look at this metadata. Solr >>>>>>>>>>>>>> cell also >>>>>>>>>>>>>> probably does not handle binary in random metadata attributes >>>>>>>>>>>>>> the proper >>>>>>>>>>>>>> way. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The connector's attachment code therefore seems to be >>>>>>>>>>>>>> designed only to deal with textual attachments. The right >>>>>>>>>>>>>> solution is to >>>>>>>>>>>>>> have individual IDs for each attachment. But that would also >>>>>>>>>>>>>> require there >>>>>>>>>>>>>> to be a URL we could construct for each attachment. We could >>>>>>>>>>>>>> provide an >>>>>>>>>>>>>> additional URI template for attachments, but I'd wonder if your >>>>>>>>>>>>>> system has >>>>>>>>>>>>>> the ability to serve attachments by their own URLs. Please let >>>>>>>>>>>>>> me know if >>>>>>>>>>>>>> this would work and if so I can create a ticket and work on >>>>>>>>>>>>>> making these >>>>>>>>>>>>>> changes. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Karl >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Feb 7, 2017 at 12:56 PM, Cihad Guzel < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I try the email connector with gmail. I attach the file [1] >>>>>>>>>>>>>>> in my new email. And sent to my test email adress. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> My mail content body is like: "this is test mail for mfc" >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Then I run my email job and the email is indexed to Solr >>>>>>>>>>>>>>> successfully. But, the solr's content field have not my >>>>>>>>>>>>>>> attachment's >>>>>>>>>>>>>>> content body. Solr content filed looks like: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> "content":" \n \n \n \n \n \n \n \n \n \n >>>>>>>>>>>>>>> --94eb2c1910841bc55f0547f43443\r\nContent-Type: >>>>>>>>>>>>>>> multipart/alternative; boundary=94eb2c1910841bc553054 >>>>>>>>>>>>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: >>>>>>>>>>>>>>> text/plain; charset=UTF-8\r\n\r\nthis is test mail for >>>>>>>>>>>>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: >>>>>>>>>>>>>>> text/html; charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test >>>>>>>>>>>>>>> mail for >>>>>>>>>>>>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n-- >>>>>>>>>>>>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: >>>>>>>>>>>>>>> application/pdf; name=\"pdf-test.pdf\"\r\nContent-Disposition: >>>>>>>>>>>>>>> attachment; >>>>>>>>>>>>>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding: >>>>>>>>>>>>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY >>>>>>>>>>>>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9MIDIwNTk3L08gNDA >>>>>>>>>>>>>>> vRSAx\r\nNDExNS9OIDEvVCAxOTc5NS9IIFsgMTAwNSAyMTVdPj4NZW5kb2J >>>>>>>>>>>>>>> qDSAgICAgICAgICAgICAgICAg\r\nDQp4cmVmDQozNyAzNA0KMDAwMDAwMDA >>>>>>>>>>>>>>> xNiAwMDAwMCBuDQowMDAwMDAxMzg2IDAwMDAwIG4NCjAw\r\nMDAwMDE1MjIgMDAwM >>>>>>>>>>>>>>> ..." >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Does the MFC email connector know that the attachment's file >>>>>>>>>>>>>>> type is pdf? Does not extract the contents? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] http://www.orimi.com/pdf-test.pdf >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>> Cihad Güzel >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Teşekkürler >>>>>>>>>>> Cihad Güzel >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Teşekkürler >>>>>>>>>> Cihad Güzel >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Teşekkürler >>>>>>>> Cihad Güzel >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Teşekkürler >>>>> Cihad Güzel >>>>> >>>> >>>> >>> >>> >>> -- >>> Teşekkürler >>> Cihad Güzel >>> >> >> >> >> -- >> Teşekkürler >> Cihad Güzel >> > > -- Teşekkürler Cihad Güzel
