Hi, this looks like you may be getting failures when writing to HDFS. For
example, if there's a problem with a batch, it's possible that it will
partially be written to HDFS. I would expect that in instances like this
you will see duplicate entries in your HDFS records. The reason for this is
that we are an at-least-once processing system, which means there may be
duplicates and/or errant records that need to be purged or skipped when
processing. A couple ways to confirm are:

   - look for any errors indexed to HDFS
   - look in ES or Solr, e.g. curl -XGET "http://node1:9200/error*/_search";



On Mon, Mar 25, 2019 at 8:25 PM [email protected] <
[email protected]> wrote:

> Hi,
>  When i try to index the data using batchSize=150,default batchTimeout and
> TimedRotationPolicy which is set to 30 Mins, it creates some json files in
> HDFS with incomplete data, the last record in the HDFS file contains only a
> portion of the json record. When try to read the indexed data using hive
> external table it throws some exception, due to the partial json in the
> file. So in streaming i am not able to do any operation on the indexed
> data.
>
> Indexex File Eg:
>     {"key1":"value1","key2":"value2","key3":"value3"}
>     {"key1":"value1","key2":"value2","key3"
>
> When i tried to find the root cause of this behavior, i came across the
> following observations
>   1. Metron flushes the data to HDFS based on a CountSyncPolicy, by
> default it's value is set to the batchSize.
>   2. When metron performs the file rotation , it first closes the current
> file, which also result in flushing to HDFS.
>   3. Regardless of the batchSize, metron writes the data to HDFS after the
> batchTimeout.
>   4. CountSyncPolicy is not having any relation with the batchTimeout,
> that is even if the batchTimeout expires and metron writes the data to the
> HDFS , it won't init the flush, it still wait for the     number of
> messages to become the CountSyncPolicy. Is this behavior set intentionally
> ? without the sync the end user won't be able to access the data
> completely, so it spoils the advantages of batchTimeout.
>
> Due to the amount of data i am writing, i won't be able to set the
> CountSyncPolicy to 1, which will impact the performance.
>
> Currently our indexing directory structure is like follows "yyyy/MM/dd". I
> need to do some operation on the newly indexed data based on a sliding
> window, now it is configured to max_window = 1 and window size = one hour.
> every hour i move the window to the current_window_hour + 1. when it's
> streaming i hit the JSON format error in hive.
>
> Can you suggest any methods to over come this issue ?
>

Reply via email to