Hi Karl, mp.getCount() is 2 and attachmentNumber is '0' or '1' in my case.
Regards, Cihad Guzel 2017-02-09 16:07 GMT+03:00 Cihad Guzel <[email protected]>: > Hi Karl, > > I made some changes in the code and then the indexing was done > successfully. > > The changes are as follows: > > I have removed these lines (lines: 772-775): > > if (mp.getCount() >= attachmentNumber) { > activities.deleteDocument(documentIdentifier); > continue; > } > > I updated these lines: (lines :1485 and 1586) > int index2 = di.indexOf("/", index1 + 1); > as like: > int index2 = di.indexOf(":", index1 + 1); > > Regards, > Cihad Guzel > > > > > 2017-02-08 2:10 GMT+03:00 Karl Wright <[email protected]>: > >> Hi Cihad, >> >> You need to set an attachment URL template for the attachments to be >> crawled. Open your email connection and click the "URL" tab, and you will >> see the new field there. >> >> Karl >> >> >> On Tue, Feb 7, 2017 at 6:07 PM, Cihad Guzel <[email protected]> wrote: >> >>> Hi Karl, >>> >>> Does not 'else' part has to be proccessed when the email has an >>> attachment? >>> Although the email has an attachment, only the first part was processed. >>> Also, I don't see the attachment's content in solr index. >>> >>> I edited the code line for testing as follow: >>> >>> if (attachmentIndex == null) { >>> // It's an email >>> System.out.println("running if block"); >>> ... >>> } else { >>> System.out.println("running else block"); >>> // It's an attachment >>> attachmentNumber = attachmentIndex; >>> ... >>> } >>> >>> Then, I run my job. It processed 3 times. The log looks as like: >>> >>> ... >>> running if block >>> running if block >>> running if block >>> ... >>> >>> >>> The solr response: >>> >>> { >>> "subject":["pdf test page"], >>> "from":["Cihad Guzel <[email protected]>"], >>> "id":"http://sampleserver/%C4%B0%C5%9F%2FmyFolder%2Ftest?id= >>> %3CCADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw%40mai >>> l.gmail.com%3E", >>> "date":["Tue Feb 07 20:37:35 MSK 2017"], >>> "mimetype":["", >>> ""], >>> "created_date":"2017-02-07T17:37:35.000Z", >>> "indexed_date":"2017-02-07T21:18:05.382Z", >>> "to":["Cihad Guzel <[email protected]>"], >>> "modified_date":"2017-02-07T17:37:35.000Z", >>> "encoding":["", >>> ""], >>> "mime_type":"text/plain", >>> "stream_size":["null"], >>> "x_parsed_by":["org.apache.tika.parser.DefaultParser", >>> "org.apache.tika.parser.txt.TXTParser"], >>> "stream_content_type":["text/plain"], >>> "content_encoding":["windows-1252"], >>> "content_type":["text/plain; charset=windows-1252"], >>> "content":" \n \n \n \n \n \n \n \n \n \n >>> --94eb2c1910841bc55f0547f43443\r\nContent-Type: multipart/alternative; >>> boundary=94eb2c1910841bc5530547f43441\r\n\r\n--94eb2c1910841 >>> bc5530547f43441\r\nContent-Type: text/plain; charset=UTF-8\r\n\r\nthis >>> is test mail for mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: >>> text/html; charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test mail for >>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n-- >>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf; >>> name=\"pdf-test.pdf\"\r\nContent-Disposition: attachment; >>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding: >>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY >>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9... ", >>> "language":"en", >>> "_version_":1558710621053124608}] >>> } >>> >>> >>> >>> 2017-02-08 1:17 GMT+03:00 Karl Wright <[email protected]>: >>> >>>> Here's the full code for this class: >>>> >>>> https://svn.apache.org/repos/asf/manifoldcf/trunk/connectors >>>> /email/connector/src/main/java/org/apache/manifoldcf/crawler >>>> /connectors/email/EmailConnector.java >>>> >>>> Karl >>>> >>>> >>>> On Tue, Feb 7, 2017 at 5:14 PM, Karl Wright <[email protected]> wrote: >>>> >>>>> Hi Cihad, >>>>> >>>>> The variable attachmentIndex is *supposed* to be null except when an >>>>> attachment is being processed. The code should look like this: >>>>> >>>>> if (attachmentIndex == null) { >>>>> // It's an email >>>>> ... >>>>> } else { >>>>> // It's an attachment >>>>> attachmentNumber = attachmentIndex; >>>>> ... >>>>> } >>>>> >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Tue, Feb 7, 2017 at 4:43 PM, Cihad Guzel <[email protected]> wrote: >>>>> >>>>>> Hi Karl, >>>>>> >>>>>> I added LOG line for testing. It looks attachmentIndex is null. >>>>>> >>>>>> 2017-02-08 0:11 GMT+03:00 Karl Wright <[email protected]>: >>>>>> >>>>>>> I attached a second patch (to apply on top of the first patch). >>>>>>> Please let me know if that fixes the issue. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Tue, Feb 7, 2017 at 3:59 PM, Cihad Guzel <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Karl, >>>>>>>> >>>>>>>> I have an error as follow: >>>>>>>> >>>>>>>> FATAL 2017-02-07 23:56:09,483 (Worker thread '29') - Error tossed: >>>>>>>> For input string: "myFolder/test:<CADNgPDgSXHeWo >>>>>>>> [email protected]>" >>>>>>>> java.lang.NumberFormatException: For input string: "myFolder/test:< >>>>>>>> cadngpdgsxhewo0gdnul6s2sogusxua9mx2wxot23wi37hog...@mail.gmail.com >>>>>>>> >" >>>>>>>> at java.lang.NumberFormatExceptio >>>>>>>> n.forInputString(NumberFormatException.java:65) >>>>>>>> at java.lang.Integer.parseInt(Integer.java:580) >>>>>>>> at java.lang.Integer.parseInt(Integer.java:615) >>>>>>>> at org.apache.manifoldcf.crawler. >>>>>>>> connectors.email.EmailConnector.processDocuments(EmailConnec >>>>>>>> tor.java:705) >>>>>>>> at org.apache.manifoldcf.crawler. >>>>>>>> system.WorkerThread.run(WorkerThread.java:399) >>>>>>>> >>>>>>>> >>>>>>>> 2017-02-07 22:50 GMT+03:00 Cihad Guzel <[email protected]>: >>>>>>>> >>>>>>>>> Thanks Karl, >>>>>>>>> >>>>>>>>> I will try it. >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> Cihad Guzel >>>>>>>>> >>>>>>>>> 2017-02-07 22:36 GMT+03:00 Karl Wright <[email protected]>: >>>>>>>>> >>>>>>>>>> I've created a ticket and attached a patch to it. >>>>>>>>>> CONNECTORS-1375. Please let me know if it works for you; if not, >>>>>>>>>> I'll fix >>>>>>>>>> what doesn't work. >>>>>>>>>> >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Feb 7, 2017 at 1:19 PM, Karl Wright <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Correction: the only metadata attribute we set is the >>>>>>>>>>> attachment(s) mimetype (as a multivalued field) -- this doesn't >>>>>>>>>>> currently >>>>>>>>>>> include the attachment data. >>>>>>>>>>> >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Feb 7, 2017 at 1:14 PM, Karl Wright <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Cihad, >>>>>>>>>>>> >>>>>>>>>>>> The email connector is providing the attachment data >>>>>>>>>>>> unextracted to the output connector as metadata attribute data. >>>>>>>>>>>> There are >>>>>>>>>>>> no transformation connectors that look at this metadata. Solr >>>>>>>>>>>> cell also >>>>>>>>>>>> probably does not handle binary in random metadata attributes the >>>>>>>>>>>> proper >>>>>>>>>>>> way. >>>>>>>>>>>> >>>>>>>>>>>> The connector's attachment code therefore seems to be designed >>>>>>>>>>>> only to deal with textual attachments. The right solution is to >>>>>>>>>>>> have >>>>>>>>>>>> individual IDs for each attachment. But that would also require >>>>>>>>>>>> there to >>>>>>>>>>>> be a URL we could construct for each attachment. We could provide >>>>>>>>>>>> an >>>>>>>>>>>> additional URI template for attachments, but I'd wonder if your >>>>>>>>>>>> system has >>>>>>>>>>>> the ability to serve attachments by their own URLs. Please let me >>>>>>>>>>>> know if >>>>>>>>>>>> this would work and if so I can create a ticket and work on making >>>>>>>>>>>> these >>>>>>>>>>>> changes. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Feb 7, 2017 at 12:56 PM, Cihad Guzel <[email protected] >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I try the email connector with gmail. I attach the file [1] in >>>>>>>>>>>>> my new email. And sent to my test email adress. >>>>>>>>>>>>> >>>>>>>>>>>>> My mail content body is like: "this is test mail for mfc" >>>>>>>>>>>>> >>>>>>>>>>>>> Then I run my email job and the email is indexed to Solr >>>>>>>>>>>>> successfully. But, the solr's content field have not my >>>>>>>>>>>>> attachment's >>>>>>>>>>>>> content body. Solr content filed looks like: >>>>>>>>>>>>> >>>>>>>>>>>>> "content":" \n \n \n \n \n \n \n \n \n \n >>>>>>>>>>>>> --94eb2c1910841bc55f0547f43443\r\nContent-Type: >>>>>>>>>>>>> multipart/alternative; boundary=94eb2c1910841bc553054 >>>>>>>>>>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: >>>>>>>>>>>>> text/plain; charset=UTF-8\r\n\r\nthis is test mail for >>>>>>>>>>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: >>>>>>>>>>>>> text/html; charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test >>>>>>>>>>>>> mail for >>>>>>>>>>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n-- >>>>>>>>>>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: >>>>>>>>>>>>> application/pdf; name=\"pdf-test.pdf\"\r\nContent-Disposition: >>>>>>>>>>>>> attachment; >>>>>>>>>>>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding: >>>>>>>>>>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY >>>>>>>>>>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9MIDIwNTk3L08gNDA >>>>>>>>>>>>> vRSAx\r\nNDExNS9OIDEvVCAxOTc5NS9IIFsgMTAwNSAyMTVdPj4NZW5kb2J >>>>>>>>>>>>> qDSAgICAgICAgICAgICAgICAg\r\nDQp4cmVmDQozNyAzNA0KMDAwMDAwMDA >>>>>>>>>>>>> xNiAwMDAwMCBuDQowMDAwMDAxMzg2IDAwMDAwIG4NCjAw\r\nMDAwMDE1MjIgMDAwM >>>>>>>>>>>>> ..." >>>>>>>>>>>>> >>>>>>>>>>>>> Does the MFC email connector know that the attachment's file >>>>>>>>>>>>> type is pdf? Does not extract the contents? >>>>>>>>>>>>> >>>>>>>>>>>>> [1] http://www.orimi.com/pdf-test.pdf >>>>>>>>>>>>> -- >>>>>>>>>>>>> Regards >>>>>>>>>>>>> Cihad Güzel >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Teşekkürler >>>>>>>>> Cihad Güzel >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Teşekkürler >>>>>>>> Cihad Güzel >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Teşekkürler >>>>>> Cihad Güzel >>>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> Teşekkürler >>> Cihad Güzel >>> >> >> > > > -- > Teşekkürler > Cihad Güzel > -- Teşekkürler Cihad Güzel
