Re: MergeContent Inquiry

Mark Payne Thu, 20 Jul 2017 05:59:59 -0700

Thierry,

I'm not sure that I understand what you mean when you say "is there a way to 
merge content based on temporal window."
Are you wanting to merge based on a rolling window, or a timestamp in the data? 
Can you explain a bit more about what
you want to do in terms of determining which data should go together?

re: QueryRecord, it is not based on Apache Drill. It is based on Apache 
Calcite. I do believe that Apache Calcite powers
Drill's SQL engine as well, but Calcite is just the SQL engine and does not do 
any sort of schema inference. At present,
you need to provide a schema for the data. If your data is in Avro, you can 
simply use the schema embedded in the data.
If the data is in CSV, you can derive the schema automatically from the header 
line (and assume that all fields are Strings).
Otherwise, you'll probably need to use the Schema Registry.

I have considered implementing some sort of schema inference processor, but 
I've not put any sort of priority on it, simply
because in my experience schema inference is convenient when it works, but 
almost always some data will come in that
doesn't adhere properly to the inferred schema and the incorrect inference ends 
up costing more time than it would have
taken to simply create the schema in the first place. Additionally, the schema 
would have to be inferred for every FlowFile,
meaning quite a lot of overhead and inefficiency in doing that. That said, I do 
understand how it would be convenient in some cases,
but I've personally just not been able to prioritize getting something like 
that done. Certainly others in the community are
welcome to look into that.

Thanks
-Mark

On Jul 20, 2017, at 8:37 AM, Thierry Hanot 
<[email protected]<mailto:[email protected]>> wrote:

Hello All
Additional question on this subject  , is there a way to merge content based  
on temporal window. The attributeRollingWindows does not help here.
This can allow in my context to build an aggregation layer ( it’s for Telemetry 
data which are coming in at different rate and I need to normalize/aggregate 
those data ) , the flow may be like this :
                Receive telemetry data
                Merge content based on the type of data and a temporal windows
                Aggregate using QueryRecord to aggregate the bulk of data : 
Normally this should be effective as it’s done per bulk .
                Then stream the result out ( backend / Mom … )

Of course all the aggregation should dynamic by merging and generating the 
query based on attributes qualifying the type of the data and the aggregation 
which need to be done.
Additional question : If I understand correctly , QueryRecord is based on Drill 
, and Drill allow to automatically infer the schema from JSON File. Is there a 
way to use this feature without going thru the SchemaRepository ?

Thanks in advance.

Thierry Hanot

From: James McMahon [mailto:[email protected]]
Sent: 20 July 2017 14:04
To: [email protected]<mailto:[email protected]>
Subject: Re: MergeContent Inquiry

Outstanding. Thank you very much Joe.

On Thu, Jul 20, 2017 at 8:00 AM, Joe Witt 
<[email protected]<mailto:[email protected]>> wrote:
Yep.  Very common.  Set the desired size or number of object targets
and set the 'Max Bin Age' so that it will kick out whatever you've got
by that time.

On Thu, Jul 20, 2017 at 7:38 AM, James McMahon 
<[email protected]<mailto:[email protected]>> wrote:
> Good morning. I have a situation where I have a staging directory into which
> may be dropped a small number or a large multitude of files. My customer
> wants me to package these up - but in a size range. I see that MergeContent
> allows me to set a MinimumGroupSize and a MaximumGroupSize.
>
> If all the files total less than the MinimumGroupSize in MB, would
> MergeContent take no action until enough files arrived to cross the minimum
> threshold - ie, would it just sit and wait? Is it possible to combine the
> size thresholds with a time parameter so that if X time passes and no new
> files appear, the package is created despite falling short of the minimum
> size threshold?
>
> Thanks in advance once again for any insights. -Jim

Re: MergeContent Inquiry

Reply via email to