Re: Oversized queue between process groups

2019-09-17 Thread Jeremy Pemberton-Pigott
I checked the logs that I can find but nothing useful, most of it is gone
because the drive was full of previous logs from it generating errors about
it not being able to write or update any files.  The logs are all gone now
as part of the recovery process removed all the logs.  What I do notice is
that if it received a large number of files from a list processor for
example, say 30 files, it isn't swapping the files back in after
swapping them out of the queue.  The queue will show files to be processed
and the next processor in the flow will show a very high tasks count in the
millions but not processing anything.  I experience the same problem with
1.9.2 and I'm trying to build a replicable flow to help clarify the
problem.  For this issue between groups it could a similar problem but I've
not been able to replicate it yet as I'm a little busy at the moment.

Regards,

Jeremy

On Sat, Aug 31, 2019 at 10:58 PM Mark Payne  wrote:

> Jeremy,
>
> Thanks for the details & history - I am indeed interested :) Do you see
> any errors in the logs? Particularly around a failure to update the
> FlowFile Repository? I am now thinking that you may be running into
> NIFI-5997 [1]. This appears to have affected at least all 1.x versions
> prior to 1.9. When a queue reaches a certain size (20,000 FlowFiles by
> default), NiFi will swap out the flowfiles to disk to avoid running out of
> memory. It does this in a batch of 10,000 FlowFiles at a time. If there's a
> problem updating the FlowFile repository, before NIFI-5997 was addressed,
> the issue is that the data would be written to a swap file as well as
> staying in the queue. And the next time that a FlowFile came in, this would
> happen again. So you'd quickly see the queue become huge and a lot of data
> written to swap files, with many duplicates of the data.
>
> Not 100% sure that this is what you're hitting, but it's my best hunch at
> the moment. Please do see if you have any errors logged around updating the
> FlowFile Repository.
>
> Thanks
> -Mark
>
>
> [1] https://issues.apache.org/jira/browse/NIFI-5997
>
>
>
> On Aug 30, 2019, at 11:20 PM, Jeremy Pemberton-Pigott <
> fuzzych...@gmail.com> wrote:
>
> Thanks for your reply Mark.
>
> The flow was in sync between nodes, no one edited it as it was started
> from a Docker image and left running. It was running about a month. A
> restart didn't clear the queue. Only 1 node had an issue the others where
> clear. The flow file repository swap directory was about 630 GB in size on
> the full node. It's running on CentOS 7.5.
>
> Below is just a bit of history if your interested otherwise skip it.
>
> The cluster is running CentOS 7.5 on those 3 nodes. Nifi was configured
> with 4GB of heap in the bootstrap. It's run on a partition with 1TB of free
> space (16 thread and 64GB RAM nodes). It had been running for almost a
> month before something happened and then started a backlog for about 1 week
> before someone noticed something was up. The partition was totally full on
> 1 node but Nifi was running, not processing anything of course on the full
> node, the other node was running, and the 3rd had lost its network
> connection on 1 card I think precipitating the problem so that node was not
> connected to the cluster.
>
> I could see the queue was about 210 million in the UI before I shut Nifi
> down to fix things. I cleared out the log folder of the full node (around
> 200 GB of Nifi app logs, for some reason it's not rolling it correctly in
> this case but other nodes are fine) and restarted but the large queue node
> was giving OOM errors on the Jetty start up so I increased the heap to 24
> GB on all nodes to get things started. It could run and the queue showed it
> was correct (I have encountered the queue clearing on restart before with
> small queues).
>
> It began processing the queue so I left it for 2 days to recover while
> clearing out the log folder periodically to keep some drive space available
> (it was generating about 40GB of logs every few hours) and the flow file
> repository swap folder size started off at about 640 GB (normally it's just
> a few MB when it's running). But I noticed that the node would stop
> processing after a short period of time with an update attribute showing a
> partially full queue of 4000 going into a funnel and the whole flow hanging
> with zero in/out everywhere I checked. Each time I restarted Nifi those
> small queues would clear but the same thing would happen.
>
> The large queue is not critical this time so I started clearing the queue
> from the NCM and it's going at a rate of about 75k flow files per minute so
> I'll leave it running over the weekend to see how far it gets while
> everything else is still running to clear other parallel queues on that
> node.
>
> Other than the one node having a large queue it is still running and the
> other nodes are working fine now. No new data is streaming in until Tuesday
> so I hope to clear the backlog on the one 

Re: NIFI - " character in CSV Reader

2019-09-17 Thread Eric Ladner
Do you need a controller for that?  I thought just a GetFile pointed at a
directory would suffice.

On Tue, Sep 17, 2019 at 5:54 PM KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> Thanks. May I know the controller service for this. I dont see that option
> in CSV reader.
>
> On Tue, Sep 17, 2019 at 5:51 PM Eric Ladner  wrote:
>
>> Try changing the CSV Format to Excel and put the escape character back to
>> \ (backslash).
>>
>> I believe the CSV format definition defines the handling of double quotes
>> separately from an escape character.  I know Excel handles double-double
>> quotes by default, tho.
>>
>> Fyi - """Teahous"" Beijing People's Art" after the removal of
>> surrounding quotes and special handling of double-double quotes should end
>> up as "Teahous" Beijing People's Art
>>
>> On Tue, Sep 17, 2019 at 5:35 PM KhajaAsmath Mohammed <
>> mdkhajaasm...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am using the CSVReader to read the data from the csv files.
>>>
>>> [image: image.png]
>>>
>>> Here is the sample data I have it.
>>>
>>> "A"|"""Teahous"" Beijing People's Art"|"""Teahous"""|"
>>>
>>> With the above CSV reader, " are completely removed. My expected output
>>> should be *""Teahous"" Beijing People's Art* for the value  
>>> *"""Teahous""
>>> Beijing People's Art" . *
>>>
>>> any suggestions please?
>>>
>>> Thanks,
>>> Asmath
>>>
>>
>>
>> --
>> Eric Ladner
>>
>

-- 
Eric Ladner


Re: NIFI - " character in CSV Reader

2019-09-17 Thread KhajaAsmath Mohammed
Thanks. May I know the controller service for this. I dont see that option
in CSV reader.

On Tue, Sep 17, 2019 at 5:51 PM Eric Ladner  wrote:

> Try changing the CSV Format to Excel and put the escape character back to
> \ (backslash).
>
> I believe the CSV format definition defines the handling of double quotes
> separately from an escape character.  I know Excel handles double-double
> quotes by default, tho.
>
> Fyi - """Teahous"" Beijing People's Art" after the removal of surrounding
> quotes and special handling of double-double quotes should end up as "Teahous"
> Beijing People's Art
>
> On Tue, Sep 17, 2019 at 5:35 PM KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Hi,
>>
>> I am using the CSVReader to read the data from the csv files.
>>
>> [image: image.png]
>>
>> Here is the sample data I have it.
>>
>> "A"|"""Teahous"" Beijing People's Art"|"""Teahous"""|"
>>
>> With the above CSV reader, " are completely removed. My expected output
>> should be *""Teahous"" Beijing People's Art* for the value  *"""Teahous""
>> Beijing People's Art" . *
>>
>> any suggestions please?
>>
>> Thanks,
>> Asmath
>>
>
>
> --
> Eric Ladner
>


Re: NIFI - " character in CSV Reader

2019-09-17 Thread Eric Ladner
Try changing the CSV Format to Excel and put the escape character back to \
(backslash).

I believe the CSV format definition defines the handling of double quotes
separately from an escape character.  I know Excel handles double-double
quotes by default, tho.

Fyi - """Teahous"" Beijing People's Art" after the removal of surrounding
quotes and special handling of double-double quotes should end up as "Teahous"
Beijing People's Art

On Tue, Sep 17, 2019 at 5:35 PM KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> Hi,
>
> I am using the CSVReader to read the data from the csv files.
>
> [image: image.png]
>
> Here is the sample data I have it.
>
> "A"|"""Teahous"" Beijing People's Art"|"""Teahous"""|"
>
> With the above CSV reader, " are completely removed. My expected output
> should be *""Teahous"" Beijing People's Art* for the value  *"""Teahous""
> Beijing People's Art" . *
>
> any suggestions please?
>
> Thanks,
> Asmath
>


-- 
Eric Ladner


NIFI - " character in CSV Reader

2019-09-17 Thread KhajaAsmath Mohammed
Hi,

I am using the CSVReader to read the data from the csv files.

[image: image.png]

Here is the sample data I have it.

"A"|"""Teahous"" Beijing People's Art"|"""Teahous"""|"

With the above CSV reader, " are completely removed. My expected output
should be *""Teahous"" Beijing People's Art* for the value  *"""Teahous""
Beijing People's Art" . *

any suggestions please?

Thanks,
Asmath


Re: Stateful Dataflow Moved to New Cluster

2019-09-17 Thread Noe Detore
this is great!

Thank you
Noe

On Tue, Sep 17, 2019 at 11:37 AM Joe Witt  wrote:

> quick reply: There is a zookeeper state migrator utility in the toolkit I
> believe.  That should be quite helpful.
>
>
> http://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html#zookeeper_migrator
>
> Thanks
>
> On Tue, Sep 17, 2019 at 11:35 AM Noe Detore 
> wrote:
>
>> Hello,
>>
>> I am currently using a stateful processor such as GetSplunk in an active
>> data flow. I want to move this data flow to a new Nifi cluster and preserve
>> the state of the processor. How can this be done?
>>
>> Thank you
>> Noe
>>
>


Re: Stateful Dataflow Moved to New Cluster

2019-09-17 Thread Joe Witt
quick reply: There is a zookeeper state migrator utility in the toolkit I
believe.  That should be quite helpful.

http://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html#zookeeper_migrator

Thanks

On Tue, Sep 17, 2019 at 11:35 AM Noe Detore  wrote:

> Hello,
>
> I am currently using a stateful processor such as GetSplunk in an active
> data flow. I want to move this data flow to a new Nifi cluster and preserve
> the state of the processor. How can this be done?
>
> Thank you
> Noe
>


Stateful Dataflow Moved to New Cluster

2019-09-17 Thread Noe Detore
Hello,

I am currently using a stateful processor such as GetSplunk in an active
data flow. I want to move this data flow to a new Nifi cluster and preserve
the state of the processor. How can this be done?

Thank you
Noe


Seeking Feedback - New ListS3 Processor Contribution??

2019-09-17 Thread Aram Openden
Hi.

The team I work in is doing a good deal of work with NiFi S3 Processors
amongst others and writing some of our own custom processors. Our team has a
similar use-case requirement for a variation on the ListS3 Processor as
Martijn Dekkers  in this post here
<
http://apache-nifi-users-list.2361937.n4.nabble.com/Listing-S3-tp5777p5850.html
>
.

For context, the reader may wish to refer to  this entire thread from the
beginning
<
http://apache-nifi-users-list.2361937.n4.nabble.com/Listing-S3-td5777.html#a5850
>
.

In our case, we would like the processor to allow for incoming FlowFiles and
be able to change the S3 bucket it "listens to" by making the /s3.bucket/
attribute modifiable using the NiFi expression language *while continuing to
maintain the internal state* of the Processor. We would simultaneously
restrict the /prefix/ property from being updated, making it a fixed value
for the entire lifetime of the Processor's running. In other words, we want
a *WatchMultipleS3Buckets* Processor that maintains state for multiple
buckets.

In practice this means, for our new processor, we would modify the internal
state management logic. Currently, the value for each entry in the StateMap
is simply the "filename" of the S3 object. Our suggested change in the
StateMap's HashMap would have this value now be of the/ bucketName + some
delimiter + filename/ of the S3 Object.

Our team is working on our variation of the processor, this
*WatchMultipleS3Buckets*. We would like to offer to contribute back to the
community this effort as follows. Since there will be a great deal of common
code between the current ListS3 Processor and our newly proposed
WatchMultipleS3Buckets Processor, we refactor to create a new Abstract
class: AbstractS3WatchProcessor with the existing ListS3 and the new
WatchMultipleS3Buckets processors as subclasses of this new
AbstractS3WatchProcessor.

Is this addition/modification something the community would be interested
in? If yes, can someone please provide me a link to the instructions on the
logistics & contribution guidelines we should follow in order to
contributing this change?

Thanks.

--Aram

-
https://github.com/aramcodz


Re: Use Array Type On Avro Schema

2019-09-17 Thread Bryan Bende
What is the error you are getting?

It looks like you don't have a type specified for any of the array fields.

Example:

{"type": "array", "items": "string"}

Also, all the arrays are empty in the example json.

On Tue, Sep 17, 2019 at 7:40 AM Wesley C. Dias de Oliveira
 wrote:
>
> Hello, community.
>
> I'm trying to convert a JSON to an Avro using the following Avro schema:
>
> {
>   "name": "team",
>   "type": "record",
>   "namespace": "datarocks",
>   "fields": [
> {
>   "name": "department_slug",
>   "type": "string"
> },
> {
>   "name": "slug",
>   "type": "string"
> },
> {
>   "name": "name",
>   "type": "string"
> },
> {
>   "name": "team_id",
>   "type": "int"
> },
> {
>   "name": "roles",
>   "type": {
> "name": "roles",
> "type": "record",
> "fields": [
>   {
> "name": "manager",
> "type": {
>   "type": "array"
> }
>   },
>   {
> "name": "supervisor",
> "type": {
>   "type": "array"
> }
>   },
>   {
> "name": "revenue_manager",
> "type": {
>   "type": "array"
> }
>   },
>   {
> "name": "analyst",
> "type": {
>   "type": "array"
> }
>   }
> ]
>   }
> },
> {
>   "name": "related_teams",
>   "type": {
> "type": "array"
>   }
> },
> {
>   "name": "properties",
>   "type": {
> "type": "array"
>   }
> }
>   ]
> }
>
> Here's the JSON I use to parse:
> {
>   "department_slug" : "org-dev",
>   "slug" : "content_team",
>   "name" : "Content Team",
>   "team_id" : 1,
>   "roles" : {
> "manager" : [ ],
> "supervisor" : [ ],
> "revenue_manager" : [ ],
> "analyst" : [ ]
>   },
>   "related_teams" : [ ],
>   "properties" : [ ]
> }
>
> Nifi version: 1.9.2
>
> The processor seems to not recognize the array type on the Avro schema.
>
> Does someone have an idea?
>
> Thanks for your help.
> --
> Grato,
> Wesley C. Dias de Oliveira.
>
> Linux User nº 576838.


Use Array Type On Avro Schema

2019-09-17 Thread Wesley C. Dias de Oliveira
Hello, community.

I'm trying to convert a JSON to an Avro using the following Avro schema:

{
  "name": "team",
  "type": "record",
  "namespace": "datarocks",
  "fields": [
{
  "name": "department_slug",
  "type": "string"
},
{
  "name": "slug",
  "type": "string"
},
{
  "name": "name",
  "type": "string"
},
{
  "name": "team_id",
  "type": "int"
},
{
  "name": "roles",
  "type": {
"name": "roles",
"type": "record",
"fields": [
  {
"name": "manager",
"type": {
  "type": "array"
}
  },
  {
"name": "supervisor",
"type": {
  "type": "array"
}
  },
  {
"name": "revenue_manager",
"type": {
  "type": "array"
}
  },
  {
"name": "analyst",
"type": {
  "type": "array"
}
  }
]
  }
},
{
  "name": "related_teams",
  "type": {
"type": "array"
  }
},
{
  "name": "properties",
  "type": {
"type": "array"
  }
}
  ]
}

Here's the JSON I use to parse:
{
  "department_slug" : "org-dev",
  "slug" : "content_team",
  "name" : "Content Team",
  "team_id" : 1,
  "roles" : {
"manager" : [ ],
"supervisor" : [ ],
"revenue_manager" : [ ],
"analyst" : [ ]
  },
  "related_teams" : [ ],
  "properties" : [ ]
}

Nifi version: 1.9.2

The processor seems to not recognize the array type on the Avro schema.

Does someone have an idea?

Thanks for your help.
-- 
Grato,
Wesley C. Dias de Oliveira.

Linux User nº 576838.


json_data_to_parse.json
Description: application/json


avro_schema.json
Description: application/json