Ok, I have a patch. It's actually pretty tiny; the bug is in our code, not Commons-IO, but Commons-IO changed things so that it tweaked it.
I've created a ticket (CONNECTORS-1271) and attached the patch to it. Thanks! Karl On Mon, Feb 8, 2016 at 4:27 PM, Karl Wright <[email protected]> wrote: > I have chased this down to a completely broken Apache Commons-IO library. > It no longer works with the JSONReader objects in ManifoldCF at all, and > refuses to read anything from them. Unfortunately I can't change versions > of that library because other things depend upon it. So I'll need to write > my own code to replace its functionality. That will take some amount of > time to do. > > This probably happened the last time our dependencies were updated. My > apologies. > > Karl > > > On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz <[email protected] > > wrote: > >> Thanks, >> >> Don't know if it'll help, but removing the usage of JSONObjectReader on >> addOrReplaceDocumentWithException and posting to Amazon chunk-by-chunk >> instead of using the JSONArrayReader on flushDocuments, changed the error I >> was getting from Amazon. >> >> Maybe those objects are failing on parsing the content to JSON. >> >> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <[email protected]> wrote: >> >>> Ok, I'm debugging away, and I can confirm that no data is getting >>> through. I'll have to open a ticket and create a patch when I find the >>> problem. >>> >>> Karl >>> >>> >>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz < >>> [email protected]> wrote: >>> >>>> Thank you very much. >>>> >>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <[email protected]> wrote: >>>> >>>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is >>>>> unhappy about the JSON format we are sending it. The deprecation message >>>>> is probably a strong clue. I'll experiment here with logging document >>>>> contents so that I can give you further advice. Stay tuned. >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz < >>>>> [email protected]> wrote: >>>>> >>>>>> I'm actually not seeing anything on Amazon. The CloudSearch connector >>>>>> fails when sending the request to amazon cloudsearch: >>>>>> >>>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status": >>>>>> "error", "errors": [{"message": "[*Deprecated*: Use the outer message >>>>>> field] Encountered unexpected end of file"}], "adds": 0, "__type": >>>>>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected end >>>>>> of >>>>>> file\"] }", "deletes": 0}' >>>>>> >>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) - >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> If you can possibly include a snippet of the JSON you are seeing on >>>>>>> the Amazon end, that would be great. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> More likely this is a bug. >>>>>>>> >>>>>>>> I take it that it is the body string that is not coming out, >>>>>>>> correct? Do all the other JSON fields look reasonable? Does the body >>>>>>>> clause exist and is just empty, or is it not there at all? >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> When running a copy of the job, but with SOLR as a target, I'm >>>>>>>>> seeing the expected content being posted to SOLR, so it may not be an >>>>>>>>> issue >>>>>>>>> with TIKA. After adding some more logging to the CloudSearch >>>>>>>>> connector, I >>>>>>>>> think the data is getting lost just before passing it to the >>>>>>>>> DocumentChunkManager, which inserts the empty records to the DB. >>>>>>>>> Could it >>>>>>>>> be that the JSONObjectReader doesn't like my data? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Juan, >>>>>>>>>> >>>>>>>>>> I'd try to reproduce as much of the pipeline as possible using a >>>>>>>>>> solr output connection. If you include the tika extractor in the >>>>>>>>>> pipeline, >>>>>>>>>> you will want to configure the solr connection to not use the >>>>>>>>>> extracting >>>>>>>>>> update handler. There's a checkbox on the Schema tab you need to >>>>>>>>>> uncheck >>>>>>>>>> for that. But if you do that you can see what is being sent to Solr >>>>>>>>>> pretty >>>>>>>>>> exactly; it all gets logged in the INFO messages dumped to solr log. >>>>>>>>>> This >>>>>>>>>> should help you figure out if the problem is your tika configuration >>>>>>>>>> or not. >>>>>>>>>> >>>>>>>>>> Please give this a try and let me know what happens. >>>>>>>>>> >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I've successfully sent data to FileSystems and SOLR, but for >>>>>>>>>>> Amazon CloudSearch I'm seeing that only empty messages are being >>>>>>>>>>> sent to my >>>>>>>>>>> domain. I think this may be an issue on how I've setup the TIKA >>>>>>>>>>> Extractor >>>>>>>>>>> Transformation or the field mapping. I think the Database where the >>>>>>>>>>> records >>>>>>>>>>> are supposed to be stored before flushing to Amazon, is storing >>>>>>>>>>> empty >>>>>>>>>>> content. >>>>>>>>>>> >>>>>>>>>>> I've tried to find documentation on how to setup the TIKA >>>>>>>>>>> Transformation, but I haven't been able to find any. >>>>>>>>>>> >>>>>>>>>>> If someone could provide an example of a job setup to send from >>>>>>>>>>> a FileSystem to CloudSearch, that'd be great! >>>>>>>>>>> >>>>>>>>>>> Thanks in advance, >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Juan Pablo Diaz-Vaz Varas, >>>>>>>>>>> Full Stack Developer - MC+A Chile >>>>>>>>>>> +56 9 84265890 >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Juan Pablo Diaz-Vaz Varas, >>>>>>>>> Full Stack Developer - MC+A Chile >>>>>>>>> +56 9 84265890 >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Juan Pablo Diaz-Vaz Varas, >>>>>> Full Stack Developer - MC+A Chile >>>>>> +56 9 84265890 >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Juan Pablo Diaz-Vaz Varas, >>>> Full Stack Developer - MC+A Chile >>>> +56 9 84265890 >>>> >>> >>> >> >> >> -- >> Juan Pablo Diaz-Vaz Varas, >> Full Stack Developer - MC+A Chile >> +56 9 84265890 >> > >
