Hi Cihad, You need to set an attachment URL template for the attachments to be crawled. Open your email connection and click the "URL" tab, and you will see the new field there.
Karl On Tue, Feb 7, 2017 at 6:07 PM, Cihad Guzel <[email protected]> wrote: > Hi Karl, > > Does not 'else' part has to be proccessed when the email has an > attachment? > Although the email has an attachment, only the first part was processed. > Also, I don't see the attachment's content in solr index. > > I edited the code line for testing as follow: > > if (attachmentIndex == null) { > // It's an email > System.out.println("running if block"); > ... > } else { > System.out.println("running else block"); > // It's an attachment > attachmentNumber = attachmentIndex; > ... > } > > Then, I run my job. It processed 3 times. The log looks as like: > > ... > running if block > running if block > running if block > ... > > > The solr response: > > { > "subject":["pdf test page"], > "from":["Cihad Guzel <[email protected]>"], > "id":"http://sampleserver/%C4%B0%C5%9F%2FmyFolder%2Ftest?id=% > 3CCADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw%40mail.gmail.com%3E > ", > "date":["Tue Feb 07 20:37:35 MSK 2017"], > "mimetype":["", > ""], > "created_date":"2017-02-07T17:37:35.000Z", > "indexed_date":"2017-02-07T21:18:05.382Z", > "to":["Cihad Guzel <[email protected]>"], > "modified_date":"2017-02-07T17:37:35.000Z", > "encoding":["", > ""], > "mime_type":"text/plain", > "stream_size":["null"], > "x_parsed_by":["org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser"], > "stream_content_type":["text/plain"], > "content_encoding":["windows-1252"], > "content_type":["text/plain; charset=windows-1252"], > "content":" \n \n \n \n \n \n \n \n \n \n -- > 94eb2c1910841bc55f0547f43443\r\nContent-Type: multipart/alternative; > boundary=94eb2c1910841bc5530547f43441\r\n\r\n-- > 94eb2c1910841bc5530547f43441\r\nContent-Type: text/plain; > charset=UTF-8\r\n\r\nthis is test mail for mfc.\r\n\r\n-- > 94eb2c1910841bc5530547f43441\r\nContent-Type: text/html; > charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test mail for > mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n-- > 94eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf; > name=\"pdf-test.pdf\"\r\nContent-Disposition: attachment; > filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding: > base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\ > nJVBERi0xLjYNJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9... ", > "language":"en", > "_version_":1558710621053124608}] > } > > > > 2017-02-08 1:17 GMT+03:00 Karl Wright <[email protected]>: > >> Here's the full code for this class: >> >> https://svn.apache.org/repos/asf/manifoldcf/trunk/connectors >> /email/connector/src/main/java/org/apache/manifoldcf/ >> crawler/connectors/email/EmailConnector.java >> >> Karl >> >> >> On Tue, Feb 7, 2017 at 5:14 PM, Karl Wright <[email protected]> wrote: >> >>> Hi Cihad, >>> >>> The variable attachmentIndex is *supposed* to be null except when an >>> attachment is being processed. The code should look like this: >>> >>> if (attachmentIndex == null) { >>> // It's an email >>> ... >>> } else { >>> // It's an attachment >>> attachmentNumber = attachmentIndex; >>> ... >>> } >>> >>> >>> Karl >>> >>> >>> On Tue, Feb 7, 2017 at 4:43 PM, Cihad Guzel <[email protected]> wrote: >>> >>>> Hi Karl, >>>> >>>> I added LOG line for testing. It looks attachmentIndex is null. >>>> >>>> 2017-02-08 0:11 GMT+03:00 Karl Wright <[email protected]>: >>>> >>>>> I attached a second patch (to apply on top of the first patch). >>>>> Please let me know if that fixes the issue. >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Tue, Feb 7, 2017 at 3:59 PM, Cihad Guzel <[email protected]> wrote: >>>>> >>>>>> Hi Karl, >>>>>> >>>>>> I have an error as follow: >>>>>> >>>>>> FATAL 2017-02-07 23:56:09,483 (Worker thread '29') - Error tossed: >>>>>> For input string: "myFolder/test:<CADNgPDgSXHeWo >>>>>> [email protected]>" >>>>>> java.lang.NumberFormatException: For input string: "myFolder/test:< >>>>>> cadngpdgsxhewo0gdnul6s2sogusxua9mx2wxot23wi37hog...@mail.gmail.com>" >>>>>> at java.lang.NumberFormatExceptio >>>>>> n.forInputString(NumberFormatException.java:65) >>>>>> at java.lang.Integer.parseInt(Integer.java:580) >>>>>> at java.lang.Integer.parseInt(Integer.java:615) >>>>>> at org.apache.manifoldcf.crawler. >>>>>> connectors.email.EmailConnector.processDocuments(EmailConnec >>>>>> tor.java:705) >>>>>> at org.apache.manifoldcf.crawler. >>>>>> system.WorkerThread.run(WorkerThread.java:399) >>>>>> >>>>>> >>>>>> 2017-02-07 22:50 GMT+03:00 Cihad Guzel <[email protected]>: >>>>>> >>>>>>> Thanks Karl, >>>>>>> >>>>>>> I will try it. >>>>>>> >>>>>>> Regards >>>>>>> Cihad Guzel >>>>>>> >>>>>>> 2017-02-07 22:36 GMT+03:00 Karl Wright <[email protected]>: >>>>>>> >>>>>>>> I've created a ticket and attached a patch to it. >>>>>>>> CONNECTORS-1375. Please let me know if it works for you; if not, I'll >>>>>>>> fix >>>>>>>> what doesn't work. >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Feb 7, 2017 at 1:19 PM, Karl Wright <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Correction: the only metadata attribute we set is the >>>>>>>>> attachment(s) mimetype (as a multivalued field) -- this doesn't >>>>>>>>> currently >>>>>>>>> include the attachment data. >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Feb 7, 2017 at 1:14 PM, Karl Wright <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Cihad, >>>>>>>>>> >>>>>>>>>> The email connector is providing the attachment data unextracted >>>>>>>>>> to the output connector as metadata attribute data. There are no >>>>>>>>>> transformation connectors that look at this metadata. Solr cell also >>>>>>>>>> probably does not handle binary in random metadata attributes the >>>>>>>>>> proper >>>>>>>>>> way. >>>>>>>>>> >>>>>>>>>> The connector's attachment code therefore seems to be designed >>>>>>>>>> only to deal with textual attachments. The right solution is to have >>>>>>>>>> individual IDs for each attachment. But that would also require >>>>>>>>>> there to >>>>>>>>>> be a URL we could construct for each attachment. We could provide an >>>>>>>>>> additional URI template for attachments, but I'd wonder if your >>>>>>>>>> system has >>>>>>>>>> the ability to serve attachments by their own URLs. Please let me >>>>>>>>>> know if >>>>>>>>>> this would work and if so I can create a ticket and work on making >>>>>>>>>> these >>>>>>>>>> changes. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Feb 7, 2017 at 12:56 PM, Cihad Guzel <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I try the email connector with gmail. I attach the file [1] in >>>>>>>>>>> my new email. And sent to my test email adress. >>>>>>>>>>> >>>>>>>>>>> My mail content body is like: "this is test mail for mfc" >>>>>>>>>>> >>>>>>>>>>> Then I run my email job and the email is indexed to Solr >>>>>>>>>>> successfully. But, the solr's content field have not my attachment's >>>>>>>>>>> content body. Solr content filed looks like: >>>>>>>>>>> >>>>>>>>>>> "content":" \n \n \n \n \n \n \n \n \n \n >>>>>>>>>>> --94eb2c1910841bc55f0547f43443\r\nContent-Type: >>>>>>>>>>> multipart/alternative; boundary=94eb2c1910841bc553054 >>>>>>>>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: >>>>>>>>>>> text/plain; charset=UTF-8\r\n\r\nthis is test mail for >>>>>>>>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type: >>>>>>>>>>> text/html; charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test mail >>>>>>>>>>> for >>>>>>>>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n-- >>>>>>>>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf; >>>>>>>>>>> name=\"pdf-test.pdf\"\r\nContent-Disposition: attachment; >>>>>>>>>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding: >>>>>>>>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY >>>>>>>>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9MIDIwNTk3L08gNDA >>>>>>>>>>> vRSAx\r\nNDExNS9OIDEvVCAxOTc5NS9IIFsgMTAwNSAyMTVdPj4NZW5kb2J >>>>>>>>>>> qDSAgICAgICAgICAgICAgICAg\r\nDQp4cmVmDQozNyAzNA0KMDAwMDAwMDA >>>>>>>>>>> xNiAwMDAwMCBuDQowMDAwMDAxMzg2IDAwMDAwIG4NCjAw\r\nMDAwMDE1MjIgMDAwM >>>>>>>>>>> ..." >>>>>>>>>>> >>>>>>>>>>> Does the MFC email connector know that the attachment's file >>>>>>>>>>> type is pdf? Does not extract the contents? >>>>>>>>>>> >>>>>>>>>>> [1] http://www.orimi.com/pdf-test.pdf >>>>>>>>>>> -- >>>>>>>>>>> Regards >>>>>>>>>>> Cihad Güzel >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Teşekkürler >>>>>>> Cihad Güzel >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Teşekkürler >>>>>> Cihad Güzel >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Teşekkürler >>>> Cihad Güzel >>>> >>> >>> >> > > > -- > Teşekkürler > Cihad Güzel >
