Hi, this looks like you may be getting failures when writing to HDFS. For example, if there's a problem with a batch, it's possible that it will partially be written to HDFS. I would expect that in instances like this you will see duplicate entries in your HDFS records. The reason for this is that we are an at-least-once processing system, which means there may be duplicates and/or errant records that need to be purged or skipped when processing. A couple ways to confirm are:
- look for any errors indexed to HDFS - look in ES or Solr, e.g. curl -XGET "http://node1:9200/error*/_search" On Mon, Mar 25, 2019 at 8:25 PM [email protected] < [email protected]> wrote: > Hi, > When i try to index the data using batchSize=150,default batchTimeout and > TimedRotationPolicy which is set to 30 Mins, it creates some json files in > HDFS with incomplete data, the last record in the HDFS file contains only a > portion of the json record. When try to read the indexed data using hive > external table it throws some exception, due to the partial json in the > file. So in streaming i am not able to do any operation on the indexed > data. > > Indexex File Eg: > {"key1":"value1","key2":"value2","key3":"value3"} > {"key1":"value1","key2":"value2","key3" > > When i tried to find the root cause of this behavior, i came across the > following observations > 1. Metron flushes the data to HDFS based on a CountSyncPolicy, by > default it's value is set to the batchSize. > 2. When metron performs the file rotation , it first closes the current > file, which also result in flushing to HDFS. > 3. Regardless of the batchSize, metron writes the data to HDFS after the > batchTimeout. > 4. CountSyncPolicy is not having any relation with the batchTimeout, > that is even if the batchTimeout expires and metron writes the data to the > HDFS , it won't init the flush, it still wait for the number of > messages to become the CountSyncPolicy. Is this behavior set intentionally > ? without the sync the end user won't be able to access the data > completely, so it spoils the advantages of batchTimeout. > > Due to the amount of data i am writing, i won't be able to set the > CountSyncPolicy to 1, which will impact the performance. > > Currently our indexing directory structure is like follows "yyyy/MM/dd". I > need to do some operation on the newly indexed data based on a sliding > window, now it is configured to max_window = 1 and window size = one hour. > every hour i move the window to the current_window_hour + 1. when it's > streaming i hit the JSON format error in hive. > > Can you suggest any methods to over come this issue ? >
