This is a puzzle; the only way this could occur is if some of the records being produced generated absolutely no JSON. Since there is an ID and a type record for all of them I can't see how this could happen. So we must be adding records for documents that don't exist somehow? I'll have to look into it.
Karl On Tue, Feb 9, 2016 at 8:49 AM, Juan Pablo Diaz-Vaz <[email protected]> wrote: > Hi, > > The patch worked and now at least the POST has content. Amazon is > responding with a Parsing Error though. > > I logged the message before it gets posted to Amazon and it's not a valid > JSON, it had extra commas and parenthesis characters when concatenating > records. Don't know if this is an issue on my setup or the JSONArrayReader. > > [{ > "id": "100D84BAF0BF348EC6EC593E5F5B1F49585DF555", > "type": "add", > "fields": { > <record fields> > } > }, , { > "id": "1E6DC8BA1E42159B14658321FDE0FC2DC467432C", > "type": "add", > "fields": { > <record fields> > } > }, , , , , , , , , , , , , , , , { > "id": "92C7EDAD8398DAC797A7DEA345C1859E6E9897FB", > "type": "add", > "fields": { > <record fields> > } > }, , , ] > > Thanks, > > On Mon, Feb 8, 2016 at 7:17 PM, Juan Pablo Diaz-Vaz <[email protected] > > wrote: > >> Thanks! I'll apply it and let you know how it goes. >> >> On Mon, Feb 8, 2016 at 6:51 PM, Karl Wright <[email protected]> wrote: >> >>> Ok, I have a patch. It's actually pretty tiny; the bug is in our code, >>> not Commons-IO, but Commons-IO changed things so that it tweaked it. >>> >>> I've created a ticket (CONNECTORS-1271) and attached the patch to it. >>> >>> Thanks! >>> Karl >>> >>> >>> On Mon, Feb 8, 2016 at 4:27 PM, Karl Wright <[email protected]> wrote: >>> >>>> I have chased this down to a completely broken Apache Commons-IO >>>> library. It no longer works with the JSONReader objects in ManifoldCF at >>>> all, and refuses to read anything from them. Unfortunately I can't change >>>> versions of that library because other things depend upon it. So I'll need >>>> to write my own code to replace its functionality. That will take some >>>> amount of time to do. >>>> >>>> This probably happened the last time our dependencies were updated. My >>>> apologies. >>>> >>>> Karl >>>> >>>> >>>> On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz < >>>> [email protected]> wrote: >>>> >>>>> Thanks, >>>>> >>>>> Don't know if it'll help, but removing the usage of JSONObjectReader >>>>> on addOrReplaceDocumentWithException and posting to Amazon chunk-by-chunk >>>>> instead of using the JSONArrayReader on flushDocuments, changed the error >>>>> I >>>>> was getting from Amazon. >>>>> >>>>> Maybe those objects are failing on parsing the content to JSON. >>>>> >>>>> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <[email protected]> >>>>> wrote: >>>>> >>>>>> Ok, I'm debugging away, and I can confirm that no data is getting >>>>>> through. I'll have to open a ticket and create a patch when I find the >>>>>> problem. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Thank you very much. >>>>>>> >>>>>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is >>>>>>>> unhappy about the JSON format we are sending it. The deprecation >>>>>>>> message >>>>>>>> is probably a strong clue. I'll experiment here with logging document >>>>>>>> contents so that I can give you further advice. Stay tuned. >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> I'm actually not seeing anything on Amazon. The CloudSearch >>>>>>>>> connector fails when sending the request to amazon cloudsearch: >>>>>>>>> >>>>>>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status": >>>>>>>>> "error", "errors": [{"message": "[*Deprecated*: Use the outer message >>>>>>>>> field] Encountered unexpected end of file"}], "adds": 0, "__type": >>>>>>>>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected >>>>>>>>> end of >>>>>>>>> file\"] }", "deletes": 0}' >>>>>>>>> >>>>>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) - >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> If you can possibly include a snippet of the JSON you are seeing >>>>>>>>>> on the Amazon end, that would be great. >>>>>>>>>> >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> More likely this is a bug. >>>>>>>>>>> >>>>>>>>>>> I take it that it is the body string that is not coming out, >>>>>>>>>>> correct? Do all the other JSON fields look reasonable? Does the >>>>>>>>>>> body >>>>>>>>>>> clause exist and is just empty, or is it not there at all? >>>>>>>>>>> >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> When running a copy of the job, but with SOLR as a target, I'm >>>>>>>>>>>> seeing the expected content being posted to SOLR, so it may not be >>>>>>>>>>>> an issue >>>>>>>>>>>> with TIKA. After adding some more logging to the CloudSearch >>>>>>>>>>>> connector, I >>>>>>>>>>>> think the data is getting lost just before passing it to the >>>>>>>>>>>> DocumentChunkManager, which inserts the empty records to the DB. >>>>>>>>>>>> Could it >>>>>>>>>>>> be that the JSONObjectReader doesn't like my data? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <[email protected] >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Juan, >>>>>>>>>>>>> >>>>>>>>>>>>> I'd try to reproduce as much of the pipeline as possible using >>>>>>>>>>>>> a solr output connection. If you include the tika extractor in >>>>>>>>>>>>> the >>>>>>>>>>>>> pipeline, you will want to configure the solr connection to not >>>>>>>>>>>>> use the >>>>>>>>>>>>> extracting update handler. There's a checkbox on the Schema tab >>>>>>>>>>>>> you need >>>>>>>>>>>>> to uncheck for that. But if you do that you can see what is >>>>>>>>>>>>> being sent to >>>>>>>>>>>>> Solr pretty exactly; it all gets logged in the INFO messages >>>>>>>>>>>>> dumped to solr >>>>>>>>>>>>> log. This should help you figure out if the problem is your tika >>>>>>>>>>>>> configuration or not. >>>>>>>>>>>>> >>>>>>>>>>>>> Please give this a try and let me know what happens. >>>>>>>>>>>>> >>>>>>>>>>>>> Karl >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I've successfully sent data to FileSystems and SOLR, but for >>>>>>>>>>>>>> Amazon CloudSearch I'm seeing that only empty messages are being >>>>>>>>>>>>>> sent to my >>>>>>>>>>>>>> domain. I think this may be an issue on how I've setup the TIKA >>>>>>>>>>>>>> Extractor >>>>>>>>>>>>>> Transformation or the field mapping. I think the Database where >>>>>>>>>>>>>> the records >>>>>>>>>>>>>> are supposed to be stored before flushing to Amazon, is storing >>>>>>>>>>>>>> empty >>>>>>>>>>>>>> content. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I've tried to find documentation on how to setup the TIKA >>>>>>>>>>>>>> Transformation, but I haven't been able to find any. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If someone could provide an example of a job setup to send >>>>>>>>>>>>>> from a FileSystem to CloudSearch, that'd be great! >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks in advance, >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas, >>>>>>>>>>>>>> Full Stack Developer - MC+A Chile >>>>>>>>>>>>>> +56 9 84265890 >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas, >>>>>>>>>>>> Full Stack Developer - MC+A Chile >>>>>>>>>>>> +56 9 84265890 >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Juan Pablo Diaz-Vaz Varas, >>>>>>>>> Full Stack Developer - MC+A Chile >>>>>>>>> +56 9 84265890 >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Juan Pablo Diaz-Vaz Varas, >>>>>>> Full Stack Developer - MC+A Chile >>>>>>> +56 9 84265890 >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Juan Pablo Diaz-Vaz Varas, >>>>> Full Stack Developer - MC+A Chile >>>>> +56 9 84265890 >>>>> >>>> >>>> >>> >> >> >> -- >> Juan Pablo Diaz-Vaz Varas, >> Full Stack Developer - MC+A Chile >> +56 9 84265890 >> > > > > -- > Juan Pablo Diaz-Vaz Varas, > Full Stack Developer - MC+A Chile > +56 9 84265890 >
