I'm using the quick start, I'll try to do a fresh start. On Tue, Feb 9, 2016 at 11:42 AM, Karl Wright <[email protected]> wrote:
> Hi Juan, > > It occurs to me that you may have records in the document chunk table that > were corrupted by the earlier version of the connector, and that is what is > being sent. Are you using the quick-start example, or Postgres? If > postgres, I'd recommend just deleting all rows in the document chunk table. > > Karl > > > On Tue, Feb 9, 2016 at 9:13 AM, Karl Wright <[email protected]> wrote: > >> This is a puzzle; the only way this could occur is if some of the records >> being produced generated absolutely no JSON. Since there is an ID and a >> type record for all of them I can't see how this could happen. So we must >> be adding records for documents that don't exist somehow? I'll have to >> look into it. >> >> Karl >> >> On Tue, Feb 9, 2016 at 8:49 AM, Juan Pablo Diaz-Vaz < >> [email protected]> wrote: >> >>> Hi, >>> >>> The patch worked and now at least the POST has content. Amazon is >>> responding with a Parsing Error though. >>> >>> I logged the message before it gets posted to Amazon and it's not a >>> valid JSON, it had extra commas and parenthesis characters when >>> concatenating records. Don't know if this is an issue on my setup or >>> the JSONArrayReader. >>> >>> [{ >>> "id": "100D84BAF0BF348EC6EC593E5F5B1F49585DF555", >>> "type": "add", >>> "fields": { >>> <record fields> >>> } >>> }, , { >>> "id": "1E6DC8BA1E42159B14658321FDE0FC2DC467432C", >>> "type": "add", >>> "fields": { >>> <record fields> >>> } >>> }, , , , , , , , , , , , , , , , { >>> "id": "92C7EDAD8398DAC797A7DEA345C1859E6E9897FB", >>> "type": "add", >>> "fields": { >>> <record fields> >>> } >>> }, , , ] >>> >>> Thanks, >>> >>> On Mon, Feb 8, 2016 at 7:17 PM, Juan Pablo Diaz-Vaz < >>> [email protected]> wrote: >>> >>>> Thanks! I'll apply it and let you know how it goes. >>>> >>>> On Mon, Feb 8, 2016 at 6:51 PM, Karl Wright <[email protected]> wrote: >>>> >>>>> Ok, I have a patch. It's actually pretty tiny; the bug is in our >>>>> code, not Commons-IO, but Commons-IO changed things so that it tweaked it. >>>>> >>>>> I've created a ticket (CONNECTORS-1271) and attached the patch to it. >>>>> >>>>> Thanks! >>>>> Karl >>>>> >>>>> >>>>> On Mon, Feb 8, 2016 at 4:27 PM, Karl Wright <[email protected]> >>>>> wrote: >>>>> >>>>>> I have chased this down to a completely broken Apache Commons-IO >>>>>> library. It no longer works with the JSONReader objects in ManifoldCF at >>>>>> all, and refuses to read anything from them. Unfortunately I can't >>>>>> change >>>>>> versions of that library because other things depend upon it. So I'll >>>>>> need >>>>>> to write my own code to replace its functionality. That will take some >>>>>> amount of time to do. >>>>>> >>>>>> This probably happened the last time our dependencies were updated. >>>>>> My apologies. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Don't know if it'll help, but removing the usage of JSONObjectReader >>>>>>> on addOrReplaceDocumentWithException and posting to Amazon >>>>>>> chunk-by-chunk >>>>>>> instead of using the JSONArrayReader on flushDocuments, changed the >>>>>>> error I >>>>>>> was getting from Amazon. >>>>>>> >>>>>>> Maybe those objects are failing on parsing the content to JSON. >>>>>>> >>>>>>> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Ok, I'm debugging away, and I can confirm that no data is getting >>>>>>>> through. I'll have to open a ticket and create a patch when I find the >>>>>>>> problem. >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Thank you very much. >>>>>>>>> >>>>>>>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is >>>>>>>>>> unhappy about the JSON format we are sending it. The deprecation >>>>>>>>>> message >>>>>>>>>> is probably a strong clue. I'll experiment here with logging >>>>>>>>>> document >>>>>>>>>> contents so that I can give you further advice. Stay tuned. >>>>>>>>>> >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> I'm actually not seeing anything on Amazon. The CloudSearch >>>>>>>>>>> connector fails when sending the request to amazon cloudsearch: >>>>>>>>>>> >>>>>>>>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status": >>>>>>>>>>> "error", "errors": [{"message": "[*Deprecated*: Use the outer >>>>>>>>>>> message >>>>>>>>>>> field] Encountered unexpected end of file"}], "adds": 0, "__type": >>>>>>>>>>> "#DocumentServiceException", "message": "{ [\"Encountered >>>>>>>>>>> unexpected end of >>>>>>>>>>> file\"] }", "deletes": 0}' >>>>>>>>>>> >>>>>>>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) - >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> If you can possibly include a snippet of the JSON you are >>>>>>>>>>>> seeing on the Amazon end, that would be great. >>>>>>>>>>>> >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <[email protected] >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>>> More likely this is a bug. >>>>>>>>>>>>> >>>>>>>>>>>>> I take it that it is the body string that is not coming out, >>>>>>>>>>>>> correct? Do all the other JSON fields look reasonable? Does the >>>>>>>>>>>>> body >>>>>>>>>>>>> clause exist and is just empty, or is it not there at all? >>>>>>>>>>>>> >>>>>>>>>>>>> Karl >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> When running a copy of the job, but with SOLR as a target, >>>>>>>>>>>>>> I'm seeing the expected content being posted to SOLR, so it may >>>>>>>>>>>>>> not be an >>>>>>>>>>>>>> issue with TIKA. After adding some more logging to the >>>>>>>>>>>>>> CloudSearch >>>>>>>>>>>>>> connector, I think the data is getting lost just before passing >>>>>>>>>>>>>> it to the >>>>>>>>>>>>>> DocumentChunkManager, which inserts the empty records to the DB. >>>>>>>>>>>>>> Could it >>>>>>>>>>>>>> be that the JSONObjectReader doesn't like my data? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Juan, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'd try to reproduce as much of the pipeline as possible >>>>>>>>>>>>>>> using a solr output connection. If you include the tika >>>>>>>>>>>>>>> extractor in the >>>>>>>>>>>>>>> pipeline, you will want to configure the solr connection to not >>>>>>>>>>>>>>> use the >>>>>>>>>>>>>>> extracting update handler. There's a checkbox on the Schema >>>>>>>>>>>>>>> tab you need >>>>>>>>>>>>>>> to uncheck for that. But if you do that you can see what is >>>>>>>>>>>>>>> being sent to >>>>>>>>>>>>>>> Solr pretty exactly; it all gets logged in the INFO messages >>>>>>>>>>>>>>> dumped to solr >>>>>>>>>>>>>>> log. This should help you figure out if the problem is your >>>>>>>>>>>>>>> tika >>>>>>>>>>>>>>> configuration or not. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Please give this a try and let me know what happens. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I've successfully sent data to FileSystems and SOLR, but >>>>>>>>>>>>>>>> for Amazon CloudSearch I'm seeing that only empty messages are >>>>>>>>>>>>>>>> being sent >>>>>>>>>>>>>>>> to my domain. I think this may be an issue on how I've setup >>>>>>>>>>>>>>>> the TIKA >>>>>>>>>>>>>>>> Extractor Transformation or the field mapping. I think the >>>>>>>>>>>>>>>> Database where >>>>>>>>>>>>>>>> the records are supposed to be stored before flushing to >>>>>>>>>>>>>>>> Amazon, is storing >>>>>>>>>>>>>>>> empty content. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I've tried to find documentation on how to setup the TIKA >>>>>>>>>>>>>>>> Transformation, but I haven't been able to find any. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If someone could provide an example of a job setup to send >>>>>>>>>>>>>>>> from a FileSystem to CloudSearch, that'd be great! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks in advance, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas, >>>>>>>>>>>>>>>> Full Stack Developer - MC+A Chile >>>>>>>>>>>>>>>> +56 9 84265890 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Juan Pablo Diaz-Vaz Varas, >>>>>>>>>>>>>> Full Stack Developer - MC+A Chile >>>>>>>>>>>>>> +56 9 84265890 >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Juan Pablo Diaz-Vaz Varas, >>>>>>>>>>> Full Stack Developer - MC+A Chile >>>>>>>>>>> +56 9 84265890 >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Juan Pablo Diaz-Vaz Varas, >>>>>>>>> Full Stack Developer - MC+A Chile >>>>>>>>> +56 9 84265890 >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Juan Pablo Diaz-Vaz Varas, >>>>>>> Full Stack Developer - MC+A Chile >>>>>>> +56 9 84265890 >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Juan Pablo Diaz-Vaz Varas, >>>> Full Stack Developer - MC+A Chile >>>> +56 9 84265890 >>>> >>> >>> >>> >>> -- >>> Juan Pablo Diaz-Vaz Varas, >>> Full Stack Developer - MC+A Chile >>> +56 9 84265890 >>> >> >> > -- Juan Pablo Diaz-Vaz Varas, Full Stack Developer - MC+A Chile +56 9 84265890
