Thierry, I'm not sure that I understand what you mean when you say "is there a way to merge content based on temporal window." Are you wanting to merge based on a rolling window, or a timestamp in the data? Can you explain a bit more about what you want to do in terms of determining which data should go together?
re: QueryRecord, it is not based on Apache Drill. It is based on Apache Calcite. I do believe that Apache Calcite powers Drill's SQL engine as well, but Calcite is just the SQL engine and does not do any sort of schema inference. At present, you need to provide a schema for the data. If your data is in Avro, you can simply use the schema embedded in the data. If the data is in CSV, you can derive the schema automatically from the header line (and assume that all fields are Strings). Otherwise, you'll probably need to use the Schema Registry. I have considered implementing some sort of schema inference processor, but I've not put any sort of priority on it, simply because in my experience schema inference is convenient when it works, but almost always some data will come in that doesn't adhere properly to the inferred schema and the incorrect inference ends up costing more time than it would have taken to simply create the schema in the first place. Additionally, the schema would have to be inferred for every FlowFile, meaning quite a lot of overhead and inefficiency in doing that. That said, I do understand how it would be convenient in some cases, but I've personally just not been able to prioritize getting something like that done. Certainly others in the community are welcome to look into that. Thanks -Mark On Jul 20, 2017, at 8:37 AM, Thierry Hanot <[email protected]<mailto:[email protected]>> wrote: Hello All Additional question on this subject , is there a way to merge content based on temporal window. The attributeRollingWindows does not help here. This can allow in my context to build an aggregation layer ( it’s for Telemetry data which are coming in at different rate and I need to normalize/aggregate those data ) , the flow may be like this : Receive telemetry data Merge content based on the type of data and a temporal windows Aggregate using QueryRecord to aggregate the bulk of data : Normally this should be effective as it’s done per bulk . Then stream the result out ( backend / Mom … ) Of course all the aggregation should dynamic by merging and generating the query based on attributes qualifying the type of the data and the aggregation which need to be done. Additional question : If I understand correctly , QueryRecord is based on Drill , and Drill allow to automatically infer the schema from JSON File. Is there a way to use this feature without going thru the SchemaRepository ? Thanks in advance. Thierry Hanot From: James McMahon [mailto:[email protected]] Sent: 20 July 2017 14:04 To: [email protected]<mailto:[email protected]> Subject: Re: MergeContent Inquiry Outstanding. Thank you very much Joe. On Thu, Jul 20, 2017 at 8:00 AM, Joe Witt <[email protected]<mailto:[email protected]>> wrote: Yep. Very common. Set the desired size or number of object targets and set the 'Max Bin Age' so that it will kick out whatever you've got by that time. On Thu, Jul 20, 2017 at 7:38 AM, James McMahon <[email protected]<mailto:[email protected]>> wrote: > Good morning. I have a situation where I have a staging directory into which > may be dropped a small number or a large multitude of files. My customer > wants me to package these up - but in a size range. I see that MergeContent > allows me to set a MinimumGroupSize and a MaximumGroupSize. > > If all the files total less than the MinimumGroupSize in MB, would > MergeContent take no action until enough files arrived to cross the minimum > threshold - ie, would it just sit and wait? Is it possible to combine the > size thresholds with a time parameter so that if X time passes and no new > files appear, the package is created despite falling short of the minimum > size threshold? > > Thanks in advance once again for any insights. -Jim
