NiFi 1.9.0 : Unable to change the state of a processor from DISABLED to ENABLED from rest api
Hi Team, I have an implementation where I receive the error bulletin in the flow, it is STOPPED and DISABLED from the rest api. Everything is working fine for months now. A few days ago, I ran into a weird issue with HDFS processors when I received the error bulletin and I stopped and disabled the flow. I received the kerberos login error: * javax.security.auth.login.LoginException: connect timed out * As per the implementation a bulletin is received and the flow is stopped and then disabled using rest api. Later when I enable the flow using the rest api, I get this error in response: Node abc.co.in:8443 is unable to fulfill this request due to: f99d210c-b20e-3870-ae1b-e1ec4702394f is not stopped. When I look in the UI, it shows that the processor is disabled. Once I enable it from the UI it starts working normally. I'm stuck with it as everytime this issue occurs, someone has to go into the NiFi UI and enable that processor manually. This is not feasible in the production environment. Any help would be appreciated. Regards, Mohit
NiFi 1.9.0: ListFile throws NoSuchFileException
Hi Team, I'm getting the following error in the LisFile Processor: ListFile[id=4ebbbe93-5e16-31d1-8793-63c704d4eae0] ListFile[id=4ebbbe93-5e16-31d1-8793-63c704d4eae0] failed to process session due to java.nio.file.NoSuchFileException: /data/test/testfile.CTL; \ Processor Administratively Yielded for 1 sec: java.io.UncheckedIOException: java.nio.file.NoSuchFileException: /data/test/testfile.CTL There are following operations done in the flow: 1. List the file from the folder(other flows also reading from the same folder). 2. Fetch the file using the FetchFile, move the file from the source folder to the processed folder. 3. Go ahead with the flow. It's a 3 Node NiFi 1.9.0 cluster, and the flow is reading from the NFS mount. ListFile is scheduled to run on the primary node. Any help would be appreciated. Thanks, Mohit
Re: Nifi takes too long to start(~ 30 minutes)
Hi Andy, Apologies for the late reply. Yes there are around 25000 processors (1000 flows with around 25 processors each, I mistyped.) I started the nifi in the debug mode to get something but nothing concrete. Following logs gets printed for all the processors: 2020-08-24 22:56:04,416 INFO org.apache.nifi.controller.StandardProcessorNode: RouteOnAttribute[id=c4ea2bfc-bd88-3df8-aab2-4edc2b3a7678] disabled so ScheduledState transitioned from STOPPED to DISABLED. Also this for the HDFS processors: 2020-08-24 22:56:05,121 DEBUG org.apache.hadoop.util.NativeCodeLoader: Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: Native Library /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop/lib/native/libhadoop.so.1.0.0 already loaded in another classloader These are the only logs occurring repeatedly. Finally after half an hour flow synchronization starts and NiFi is up after 2-3 minutes max. There are 5 custom processors being used in the flow. Thanks, Mohit On Wed, Aug 19, 2020 at 5:07 AM Andy LoPresto wrote: > Mohit, > > I’m a little confused by some of the details you report. Initially you > said there were ~100 flows with 25 processors each (2500); now there are > 25,000. If you examine the logs, you note that “reaching the election > process” takes the majority of the time — what messages are being printed > in the log before the election process starts? Have you taken thread dumps > at regular intervals during this time to see where the time is being spent? > > This could be caused by the cost of fingerprinting such a large flow > during cluster startup. Are there any custom processors in your NiFi > instance? > > > Andy LoPresto > alopre...@apache.org > *alopresto.apa...@gmail.com * > He/Him > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 > > On Aug 16, 2020, at 10:28 PM, Mohit Jain wrote: > > Hi Pierre, > > No, the election process takes only a minute or two max, reaching the > election process takes a lot of time. > There are around 25k stopped/invalid components in the UI, I marked all of > them as disabled hoping it would resolve the issue but no improvement in > the startup time. > > On a similar note, NiFi Rest api response is also quite slow when there > are a lot of components. Would it improve by disabling the components which > are not in use? Or is there something else that could have been causing the > issue? > > Thanks, > Mohit > > On Thu, Aug 13, 2020 at 9:49 PM Pierre Villard < > pierre.villard...@gmail.com> wrote: > >> Hi, >> >> I'm surprised this is something you observe in the election process part. >> I've constantly seen quick startup times even with thousands of >> components in the flow. >> I'd look into the logs and maybe turn on some debug logs to find out >> what's going on. >> >> Pierre >> >> Le jeu. 13 août 2020 à 16:33, Joe Witt a écrit : >> >>> Mohit, >>> >>> You almost certainly want to take that same flow and setup a cluster on >>> a more recent version of NiFi to compare startup times. For flows with >>> thousands of components there are important improvements which have >>> occurred in the past year and a half. >>> >>> Startup time, user perceived behavior in the UI on continuous >>> operations, etc.. have been improved. Further you can now hot load new >>> versions of nars which should reduce the need to restart. >>> >>> We also will have 1.12 out hopefully within days so that could be >>> interesting for you as well. >>> >>> Thanks >>> >>> On Thu, Aug 13, 2020 at 7:18 AM Mohit Jain >>> wrote: >>> >>>> Hi Team, >>>> >>>> I am using a single node NiFi 1.9.0 cluster. It takes more than 30 >>>> minutes to start each time it is restarted. There are more than 100 flows >>>> on the NiFi UI with an average of 25 processors per flow. It takes around >>>> 25-30 minutes to reach the cluster election process after which it gets >>>> started in a minute. >>>> >>>> Is this an expected behaviour that startup time is directly >>>> proportional to the number of processors in the Canvas? Or is there a way >>>> to reduce the NiFi startup time? >>>> >>>> Any leads would be appreciated. >>>> >>>> Thanks, >>>> Mohit >>>> >>> >
Re: Nifi takes too long to start(~ 30 minutes)
Hi Pierre, No, the election process takes only a minute or two max, reaching the election process takes a lot of time. There are around 25k stopped/invalid components in the UI, I marked all of them as disabled hoping it would resolve the issue but no improvement in the startup time. On a similar note, NiFi Rest api response is also quite slow when there are a lot of components. Would it improve by disabling the components which are not in use? Or is there something else that could have been causing the issue? Thanks, Mohit On Thu, Aug 13, 2020 at 9:49 PM Pierre Villard wrote: > Hi, > > I'm surprised this is something you observe in the election process part. > I've constantly seen quick startup times even with thousands of components > in the flow. > I'd look into the logs and maybe turn on some debug logs to find out > what's going on. > > Pierre > > Le jeu. 13 août 2020 à 16:33, Joe Witt a écrit : > >> Mohit, >> >> You almost certainly want to take that same flow and setup a cluster on a >> more recent version of NiFi to compare startup times. For flows with >> thousands of components there are important improvements which have >> occurred in the past year and a half. >> >> Startup time, user perceived behavior in the UI on continuous operations, >> etc.. have been improved. Further you can now hot load new versions of nars >> which should reduce the need to restart. >> >> We also will have 1.12 out hopefully within days so that could be >> interesting for you as well. >> >> Thanks >> >> On Thu, Aug 13, 2020 at 7:18 AM Mohit Jain >> wrote: >> >>> Hi Team, >>> >>> I am using a single node NiFi 1.9.0 cluster. It takes more than 30 >>> minutes to start each time it is restarted. There are more than 100 flows >>> on the NiFi UI with an average of 25 processors per flow. It takes around >>> 25-30 minutes to reach the cluster election process after which it gets >>> started in a minute. >>> >>> Is this an expected behaviour that startup time is directly proportional >>> to the number of processors in the Canvas? Or is there a way to reduce the >>> NiFi startup time? >>> >>> Any leads would be appreciated. >>> >>> Thanks, >>> Mohit >>> >>
Re: Nifi takes too long to start(~ 30 minutes)
Hi Brandon, There are no flow files in the queue. Thanks, Mohit On Thu, Aug 13, 2020 at 9:56 PM Brandon DeVries wrote: > Mohit, > > How many flowfiles are currently on the instance? Sometimes a very large > number of flowfiles can result in slower start times. > > Brandon > > -- > *From:* Pierre Villard > *Sent:* Thursday, August 13, 2020 12:10:51 PM > *To:* users@nifi.apache.org > *Subject:* Re: Nifi takes too long to start(~ 30 minutes) > > Hi, > > I'm surprised this is something you observe in the election process part. > I've constantly seen quick startup times even with thousands of components > in the flow. > I'd look into the logs and maybe turn on some debug logs to find out > what's going on. > > Pierre > > Le jeu. 13 août 2020 à 16:33, Joe Witt a écrit : > > Mohit, > > You almost certainly want to take that same flow and setup a cluster on a > more recent version of NiFi to compare startup times. For flows with > thousands of components there are important improvements which have > occurred in the past year and a half. > > Startup time, user perceived behavior in the UI on continuous operations, > etc.. have been improved. Further you can now hot load new versions of nars > which should reduce the need to restart. > > We also will have 1.12 out hopefully within days so that could be > interesting for you as well. > > Thanks > > On Thu, Aug 13, 2020 at 7:18 AM Mohit Jain > wrote: > > Hi Team, > > I am using a single node NiFi 1.9.0 cluster. It takes more than 30 minutes > to start each time it is restarted. There are more than 100 flows on the > NiFi UI with an average of 25 processors per flow. It takes around 25-30 > minutes to reach the cluster election process after which it gets started > in a minute. > > Is this an expected behaviour that startup time is directly proportional > to the number of processors in the Canvas? Or is there a way to reduce the > NiFi startup time? > > Any leads would be appreciated. > > Thanks, > Mohit > >
Nifi takes too long to start(~ 30 minutes)
Hi Team, I am using a single node NiFi 1.9.0 cluster. It takes more than 30 minutes to start each time it is restarted. There are more than 100 flows on the NiFi UI with an average of 25 processors per flow. It takes around 25-30 minutes to reach the cluster election process after which it gets started in a minute. Is this an expected behaviour that startup time is directly proportional to the number of processors in the Canvas? Or is there a way to reduce the NiFi startup time? Any leads would be appreciated. Thanks, Mohit
Re: Updating the Concurent Tasks of a NiFi processor using Rest API
Thanks Otto, That helped. On Fri, Jul 31, 2020 at 4:54 PM Otto Fowler wrote: > If you do it from a browser, and are using the browser debugging tools, > you will be able to catch the call. Everything you do in the web can be > done from rest somehow. > > > > > On July 31, 2020 at 06:47:06, Mohit Jain (mo...@open-insights.com) wrote: > > Hi Team, > > Is there a Rest API to change the number of concurrent tasks? I wasn't > able to find any in the doc. > > Any leads would be appreciated. > > Thanks, > Mohit > >
Updating the Concurent Tasks of a NiFi processor using Rest API
Hi Team, Is there a Rest API to change the number of concurrent tasks? I wasn't able to find any in the doc. Any leads would be appreciated. Thanks, Mohit
NiFi 1.9.0 MoveHDFS error: Failed to rename on HDFS
Hi Team, I'm trying to move the file to another location using the MoveHDFS processor. I have used the *Conflict Resolution Strategy* = *replace. *It throws the error when there is a file already located in the target location. Error log: MoveHDFS[id=] Failed to rename on HDFS due to could not move file '/path/to/source' to its final filename. Any help would be appreciated. Regards, Mohit
Re: Urgent: HDFS processors throwing OOM - Compressed class space exception
Those are the small files not more than 100 MB each. Nifi is configured to 16g. Thanks Get Outlook for iOS<https://aka.ms/o0ukef> From: Jorge Machado Sent: Wednesday, July 22, 2020 2:55:26 PM To: users@nifi.apache.org Subject: Re: Urgent: HDFS processors throwing OOM - Compressed class space exception How big are the files that you are trying to store? How much memory did you configure nifi ? > On 22. Jul 2020, at 06:13, Mohit Jain wrote: > > Hi team, > > I’ve been facing the issue while using any HDFS processor, e.g. - PutHDFS > throws the error - > Failed to write to HDFS due to compressed class space: > java.lang.OutOFMemoryError > > Eventually the node gets disconnected. > > Any help would be appreciated. > > Regards, > Mohit >
Re: Urgent: HDFS processors throwing OOM - Compressed class space exception
Apologies. I intended to send it on the user list. I mistakenly sent it on the dev list.. that’s why sent it here again. Thanks, Mohit Get Outlook for iOS<https://aka.ms/o0ukef> From: Joe Witt Sent: Wednesday, July 22, 2020 10:05:16 AM To: users@nifi.apache.org Subject: Re: Urgent: HDFS processors throwing OOM - Compressed class space exception you had already sent this urgent note to dev. now cross posted 40 minutes later. please dont On Tue, Jul 21, 2020 at 9:14 PM Mohit Jain mailto:mo...@open-insights.com>> wrote: Hi team, I’ve been facing the issue while using any HDFS processor, e.g. - PutHDFS throws the error - Failed to write to HDFS due to compressed class space: java.lang.OutOFMemoryError Eventually the node gets disconnected. Any help would be appreciated. Regards, Mohit
Re: Urgent: Facing issue while trying to open a NiFi processor
Hi Team, It seems like the nifi-flow-audit.h2.db is corrupted. I'm not able to execute the same query on the database directly. Anything I can do to fix this? And what are the scenarios when this might happen again? Regards, Mohit On Sat, Jul 11, 2020 at 2:48 PM Mohit Jain wrote: > Hi Team, > > I'm facing a weird issue while trying to configure any NiFi processor: > > org.apache.nifi.admin.dao.DataAccessException: > org.h2.jdbc.JdbcSQLException: Timeout trying to lock table > "CONFIGURE_DETAILS"; SQL statement: > SELECT DISTINCT CD.NAME FROM CONFIGURE_DETAILS CD INNER JOIN ACTION A ON > CD.ACTION_ID = A.ID WHERE A.SOURCE_ID = ? [50200-176] > at > org.apache.nifi.admin.service.impl.StandardAuditService.getPreviousValues(StandardAuditService.java:102) > at > org.apache.nifi.web.StandardNiFiServiceFacade.getComponentHistory(StandardNiFiServiceFacade.java:4537) > at > org.apache.nifi.web.StandardNiFiServiceFacade$$FastClassBySpringCGLIB$$358780e0.invoke() > > Complete nifi-user.log is attached in the log. > > Any input would be appreciated. > > Regards, > Mohit >
Urgent: Facing issue while trying to open a NiFi processor
Hi Team, I'm facing a weird issue while trying to configure any NiFi processor: org.apache.nifi.admin.dao.DataAccessException: org.h2.jdbc.JdbcSQLException: Timeout trying to lock table "CONFIGURE_DETAILS"; SQL statement: SELECT DISTINCT CD.NAME FROM CONFIGURE_DETAILS CD INNER JOIN ACTION A ON CD.ACTION_ID = A.ID WHERE A.SOURCE_ID = ? [50200-176] at org.apache.nifi.admin.service.impl.StandardAuditService.getPreviousValues(StandardAuditService.java:102) at org.apache.nifi.web.StandardNiFiServiceFacade.getComponentHistory(StandardNiFiServiceFacade.java:4537) at org.apache.nifi.web.StandardNiFiServiceFacade$$FastClassBySpringCGLIB$$358780e0.invoke() Complete nifi-user.log is attached in the log. Any input would be appreciated. Regards, Mohit 2020-07-11 09:06:53,063 ERROR [NiFi Web Server-28918] o.a.n.w.a.c.AdministrationExceptionMapper org.apache.nifi.admin.service.AdministrationException: org.apache.nifi.admin.dao.DataAccessException: org.h2.jdbc.JdbcSQLException: Timeout trying to lock table "CONFIGURE_DETAILS"; SQL statement: SELECT DISTINCT CD.NAME FROM CONFIGURE_DETAILS CD INNER JOIN ACTION A ON CD.ACTION_ID = A.ID WHERE A.SOURCE_ID = ? [50200-176]. Returning Internal Server Error response. org.apache.nifi.admin.service.AdministrationException: org.apache.nifi.admin.dao.DataAccessException: org.h2.jdbc.JdbcSQLException: Timeout trying to lock table "CONFIGURE_DETAILS"; SQL statement: SELECT DISTINCT CD.NAME FROM CONFIGURE_DETAILS CD INNER JOIN ACTION A ON CD.ACTION_ID = A.ID WHERE A.SOURCE_ID = ? [50200-176] at org.apache.nifi.admin.service.impl.StandardAuditService.getPreviousValues(StandardAuditService.java:102) at org.apache.nifi.web.StandardNiFiServiceFacade.getComponentHistory(StandardNiFiServiceFacade.java:4537) at org.apache.nifi.web.StandardNiFiServiceFacade$$FastClassBySpringCGLIB$$358780e0.invoke() at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:736) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157) at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:84) at org.apache.nifi.web.NiFiServiceFacadeLock.proceedWithReadLock(NiFiServiceFacadeLock.java:155) at org.apache.nifi.web.NiFiServiceFacadeLock.getLock(NiFiServiceFacadeLock.java:120) at sun.reflect.GeneratedMethodAccessor504.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:627) at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:616) at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:70) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179) at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179) at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:671) at org.apache.nifi.web.StandardNiFiServiceFacade$$EnhancerBySpringCGLIB$$c0d134ca.getComponentHistory() at org.apache.nifi.web.api.FlowResource.getComponentHistory(FlowResource.java:2548) at sun.reflect.GeneratedMethodAccessor1088.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:148) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:191) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:200) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:103) at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:493) at org.glassfish.jersey.server.mod
HiveConnectionPool doesn't get enabled with the Nifi-Api in NiFi 1.9.2
Hi Team, I have recentlyl upgraded to Nifi-1.9.2 from Nifi-1.6.0. I have written the wrapper to start and stop the flow using Nifi-Api. While starting the controller services, HiveConnectionPool is not getting enabled. All other controller services(e.g., Database controller service, AvroReader, AvroWriter, etc) gets enabled. I'm not able to figure out the reason behind it. One thing I witnessed is that the PutHiveQL is the last processor in the flow. Seems like the last controller service doesn't get enabled Kindly Help. Regards, Mohit
Unable to login to Nifi UI with ranger
Hi, I've integrated Nifi-1.9.2 with ranger. When I login to the UI, following error is shown - No applicable policies could be found. Contact the system administrator. I'm not able to figure out what I'm missing. Kindly help. Regards, Mohit
Re: GenerateTable fetch throws error with MS SQL 2012+
Hi Matt, Does it means that all the other DBs(Oracle, MySQL, etc) might fetch duplicates rows with GenerateTableFetch? Thanks, Mohit On Thu, Aug 8, 2019 at 8:16 PM Matt Burgess wrote: > Mohit, > > MSSQL seems to have the only parser (of the more popular DBs) that > complains about an empty ORDER BY clause when doing paging (with > OFFSET / FETCH NEXT), so the default behavior is to throw an > exception. This happens when you don't supply a Max Value Column, > meaning you want to fetch all rows but haven't specified how to order > them, to ensure that each page gets unique rows and all rows are > returned. Otherwise each fetch could grab duplicate rows and some rows > may never be fetched, as the ordering is arbitrary. For some reason > other DBs (PostgreSQL, Oracle) lets you try it with the documented > caveat that the ordering is arbitrary. > > So an ordering must be applied, and to that end we added a Custom > ORDER BY Clause property [1] such that the user can provide their own > clause if no Max Value Column is specified. This property will be > available in the upcoming NiFi 1.10.0 release. The only workarounds I > know of are to use QueryDatabaseTable instead of GenerateTableFetch, > or to supply a Max Value Column property value in GenerateTableFetch. > > Regards, > Matt > > [1] https://issues.apache.org/jira/browse/NIFI-6348 > > On Thu, Aug 8, 2019 at 6:18 AM Mohit Jain wrote: > > > > Hi Team, > > > > I'm facing the following issue while using MS SQL 2012+. in Nifi 1.6.0 - > > > > GenerateTableFetch[id=87d43fae-5c02-1f3f-a58f-9fafeac5a640] > GenerateTableFetch[id=87d43fae-5c02-1f3f-a58f-9fafeac5a640] failed to > process session due to Order by clause cannot be null or empty when using > row paging; Processor Administratively Yielded for 1 sec: > java.lang.IllegalArgumentException: Order by clause cannot be null or empty > when using row paging > > > > It seems like a bug to me, is it resolved in the latest versions of Nifi? > > > > Thanks, > > Mohit > > > > > > > > > > >
GenerateTable fetch throws error with MS SQL 2012+
Hi Team, I'm facing the following issue while using MS SQL 2012+. in Nifi 1.6.0 - *GenerateTableFetch[id=87d43fae-5c02-1f3f-a58f-9fafeac5a640] GenerateTableFetch[id=87d43fae-5c02-1f3f-a58f-9fafeac5a640] failed to process session due to Order by clause cannot be null or empty when using row paging; Processor Administratively Yielded for 1 sec: java.lang.IllegalArgumentException: Order by clause cannot be null or empty when using row paging* It seems like a bug to me, is it resolved in the latest versions of Nifi? Thanks, Mohit
Re: Unable to login to secured nifi cluster via UI
Thanks Pierre. On Wed, Aug 7, 2019 at 5:51 PM Pierre Villard wrote: > Hi Mohit, > > The initial admin you configured is the user you should use for the first > connection in order to grant authorizations to additional users/groups. The > initial admin should have been automatically added in the > authorizations.xml file created during the first start of NiFi. > > Hope this helps, > Pierre > > Le mer. 7 août 2019 à 14:12, Mohit Jain a > écrit : > >> Hi team, >> >> I'm not able to open the nifi canvas in the secured NiFi. It shows the >> following error message once I provides the credentials - >> *No applicable policies could be found. Contact the system administrator.* >> >> [image: image.png] >> Kindly help. >> >> Thanks, >> Mohit >> >>
Unable to login to secured nifi cluster via UI
Hi team, I'm not able to open the nifi canvas in the secured NiFi. It shows the following error message once I provides the credentials - *No applicable policies could be found. Contact the system administrator.* [image: image.png] Kindly help. Thanks, Mohit
RE: Unable to see Nifi data lineage in Atlas
Hi, While looking at the logs, I found out that ReportingLineageToAtlas is not able to construct KafkaProducer. It throws the following logs - org.apache.kafka.common.KafkaException: Failed to construct kafka producer at org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:33 5) at org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:18 8) at org.apache.atlas.kafka.KafkaNotification.createProducer(KafkaNotification.ja va:286) at org.apache.atlas.kafka.KafkaNotification.sendInternal(KafkaNotification.java :207) at org.apache.atlas.notification.AbstractNotification.send(AbstractNotification .java:84) at org.apache.atlas.hook.AtlasHook.notifyEntitiesInternal(AtlasHook.java:133) at org.apache.atlas.hook.AtlasHook.notifyEntities(AtlasHook.java:118) at org.apache.atlas.hook.AtlasHook.notifyEntities(AtlasHook.java:171) at org.apache.nifi.atlas.NiFiAtlasHook.commitMessages(NiFiAtlasHook.java:150) at org.apache.nifi.atlas.reporting.ReportLineageToAtlas.lambda$consumeNiFiProve nanceEvents$6(ReportLineageToAtlas.java:721) at org.apache.nifi.reporting.util.provenance.ProvenanceEventConsumer.consumeEve nts(ProvenanceEventConsumer.java:204) at org.apache.nifi.atlas.reporting.ReportLineageToAtlas.consumeNiFiProvenanceEv ents(ReportLineageToAtlas.java:712) at org.apache.nifi.atlas.reporting.ReportLineageToAtlas.onTrigger(ReportLineage ToAtlas.java:664) at org.apache.nifi.controller.tasks.ReportingTaskWrapper.run(ReportingTaskWrapp er.java:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$ 301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Sch eduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11 49) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6 24) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalArgumentException: No enum constant org.apache.kafka.common.protocol.SecurityProtocol.PLAINTEXTSASL at java.lang.Enum.valueOf(Enum.java:238) at org.apache.kafka.common.protocol.SecurityProtocol.valueOf(SecurityProtocol.j ava:28) at org.apache.kafka.common.protocol.SecurityProtocol.forName(SecurityProtocol.j ava:89) at org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:7 9) at org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:27 7) ... 20 common frames omitted Thanks, Mohit From: Mohit Sent: 25 July 2018 17:46 To: users@nifi.apache.org Subject: Unable to see Nifi data lineage in Atlas Hi all, I have configured ReportingLineageToAtlas reporting task to send Nifi flow information to Atlas. Nifi is integrated with Ranger. I am able to see all the information in the Atlas except the lineage. When I search for hdfs_path or hive_table, I can only see the hive side information. I can't figure out anything wrong in the configuration. Is there something in the Ranger configuration that I'm missing? Regards, Mohit
Unable to see Nifi data lineage in Atlas
Hi all, I have configured ReportingLineageToAtlas reporting task to send Nifi flow information to Atlas. Nifi is integrated with Ranger. I am able to see all the information in the Atlas except the lineage. When I search for hdfs_path or hive_table, I can only see the hive side information. I can't figure out anything wrong in the configuration. Is there something in the Ranger configuration that I'm missing? Regards, Mohit
GenerateTableFetch -> RPG -> ExecuteSQL fetching duplicate records from Netezza but the count is same.
Hi all, I am fetching data from Netezza using GenerateTableFetch -> RPG -> ExecuteSQL -> PutHDFS . It is working fine for most of the time, but for some tables with more than a million rows, it fetches duplicate rows. Partition Size varies from 3 million to 30 million with respect to table size. For table with ~300 million rows, size is 30 million and likewise. For Example - Table : abc Netezza count - 3265421 Hive Count - 3265421 Duplicate rows in Hive - 97070 Is this the expected behaviour while fetching from Netezza? Regards, Mohit
RE: SelectHiveQl gets stuck when query table containning 12 Billion rows
Thanks Shawn, I followed a similar approach. Regards, Mohit From: Shawn Weeks Sent: 27 June 2018 19:22 To: users@nifi.apache.org Subject: Re: SelectHiveQl gets stuck when query table containning 12 Billion rows Well to get the partitions you can execute a 'show partitions table_name', then you can use the SplitRecord with an AvroReader and JSON Writer to generate a flow file for partition. That flow file can then be read with EvaluateJsonPath to pull the partition_name into an attribute on the flow file. Then finally a ReplaceText to actual write out the select statement substituting the partition variable. Thanks Shawn _ From: Mohit mailto:mohit.j...@open-insights.co.in> > Sent: Wednesday, June 27, 2018 8:40:20 AM To: users@nifi.apache.org <mailto:users@nifi.apache.org> Subject: RE: SelectHiveQl gets stuck when query table containning 12 Billion rows Hi, Yes I tried to fetch around 40 million rows which took time but it was executed. I'll try with the Avro thing. How to break the select into multiple part? Can you explain in brief the partition flow to start with? Thanks, Mohit From: Shawn Weeks mailto:swe...@weeksconsulting.us> > Sent: 27 June 2018 18:51 To: users@nifi.apache.org <mailto:users@nifi.apache.org> Subject: Re: SelectHiveQl gets stuck when query table containning 12 Billion rows It's probably not stuck doing nothing, using a JDBC connection to fetch 12 Billion rows is going to be painful no matter what you do. At those kind of sizes you're probably better off having Hive create a temporary table in Avro format and then consuming the Avro files from HDFS into NiFi. The largest number of rows I've pulled into NiFi via JDBC in a single query is around 10-20 Million and that took a long time. You can also try breaking the select into multiple parts and running them simultaneously. I've done something similar where I first ran a query to get all of the partitions and then I executed a select for each partition in parallel. Thanks Shawn _ From: Mohit mailto:mohit.j...@open-insights.co.in> > Sent: Wednesday, June 27, 2018 8:14:25 AM To: users@nifi.apache.org <mailto:users@nifi.apache.org> Subject: SelectHiveQl gets stuck when query table containning 12 Billion rows Hi all, I'm trying to fetch data from hive using SelectHiveQL. It works fine for small to medium sized tables, but when I try to fetch data from large table with around 12 billion rows it gets stuck for hours but do nothing. I have set the Max Row per flowfile property to 10 million. We have a 4 node NiFi cluster with 150GB RAM memory each. Is there any configuration which is to be manipulated to make this work? Regards, Mohit
RE: SelectHiveQl gets stuck when query table containning 12 Billion rows
Hi, Yes I tried to fetch around 40 million rows which took time but it was executed. I'll try with the Avro thing. How to break the select into multiple part? Can you explain in brief the partition flow to start with? Thanks, Mohit From: Shawn Weeks Sent: 27 June 2018 18:51 To: users@nifi.apache.org Subject: Re: SelectHiveQl gets stuck when query table containning 12 Billion rows It's probably not stuck doing nothing, using a JDBC connection to fetch 12 Billion rows is going to be painful no matter what you do. At those kind of sizes you're probably better off having Hive create a temporary table in Avro format and then consuming the Avro files from HDFS into NiFi. The largest number of rows I've pulled into NiFi via JDBC in a single query is around 10-20 Million and that took a long time. You can also try breaking the select into multiple parts and running them simultaneously. I've done something similar where I first ran a query to get all of the partitions and then I executed a select for each partition in parallel. Thanks Shawn _ From: Mohit mailto:mohit.j...@open-insights.co.in> > Sent: Wednesday, June 27, 2018 8:14:25 AM To: users@nifi.apache.org <mailto:users@nifi.apache.org> Subject: SelectHiveQl gets stuck when query table containning 12 Billion rows Hi all, I'm trying to fetch data from hive using SelectHiveQL. It works fine for small to medium sized tables, but when I try to fetch data from large table with around 12 billion rows it gets stuck for hours but do nothing. I have set the Max Row per flowfile property to 10 million. We have a 4 node NiFi cluster with 150GB RAM memory each. Is there any configuration which is to be manipulated to make this work? Regards, Mohit
SelectHiveQl gets stuck when query table containning 12 Billion rows
Hi all, I'm trying to fetch data from hive using SelectHiveQL. It works fine for small to medium sized tables, but when I try to fetch data from large table with around 12 billion rows it gets stuck for hours but do nothing. I have set the Max Row per flowfile property to 10 million. We have a 4 node NiFi cluster with 150GB RAM memory each. Is there any configuration which is to be manipulated to make this work? Regards, Mohit
RE: Unable to read the Parquet file written by NiFi through Spark when Logical Data Type is set to true.
Hi, Spark 2.3 has avro-1.7.7.jar. But hive 1.2 also uses avro-1.7.5.jar and I’m able to read the parquet file from hive. Not sure if this is the reason. Thanks, Mohit From: Mike Thomsen Sent: 31 May 2018 14:15 To: users@nifi.apache.org Subject: Re: Unable to read the Parquet file written by NiFi through Spark when Logical Data Type is set to true. Maybe check to see which version of Avro is bundled with your deployment of Spark? On Thu, May 31, 2018 at 3:26 AM Mohit mailto:mohit.j...@open-insights.co.in> > wrote: Hi Mike, I have created the hive external table on the top of parquet and able to read it from hive. While querying hive from spark these are the errors – For the decimal type, I get the following error – (In hive, data type is decimal(12,5)) Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs://ip-10-0-0-216.ap-south-1.compute.internal:8020/user/hermes/nifi_test1/test_pqt2_dcm/4963966040134. Column: [dc_type], Expected: DecimalType(12,5), Found: BINARY at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:192) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109) For the time(which was converted into TIME_MILLIS) – (In Hive, the data type is int) Caused by: org.apache.spark.sql.AnalysisException: Parquet type not yet supported: INT32 (TIME_MILLIS); at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.typeNotImplemented$1(ParquetSchemaConverter.scala:105) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:141) Thanks, Mohit From: Mike Thomsen mailto:mikerthom...@gmail.com> > Sent: 30 May 2018 17:28 To: users@nifi.apache.org <mailto:users@nifi.apache.org> Subject: Re: Unable to read the Parquet file written by NiFi through Spark when Logical Data Type is set to true. What's the error from Spark? Logical data types are just a variant on existing data types in Avro 1.8. On Wed, May 30, 2018 at 7:54 AM Mohit mailto:mohit.j...@open-insights.co.in> > wrote: Hi all, I’m fetching the data from RDBMS and writing it to parquet using PutParquet processor. I’m not able to read the data from Spark when Logical Data Type is true. I’m able to read it from Hive. Do I have to set some specific properties in the PutParquet processor to make it readable from spark as well? Regards, Mohit
RE: Unable to read the Parquet file written by NiFi through Spark when Logical Data Type is set to true.
Hi Mike, I have created the hive external table on the top of parquet and able to read it from hive. While querying hive from spark these are the errors – For the decimal type, I get the following error – (In hive, data type is decimal(12,5)) Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs://ip-10-0-0-216.ap-south-1.compute.internal:8020/user/hermes/nifi_test1/test_pqt2_dcm/4963966040134. Column: [dc_type], Expected: DecimalType(12,5), Found: BINARY at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:192) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109) For the time(which was converted into TIME_MILLIS) – (In Hive, the data type is int) Caused by: org.apache.spark.sql.AnalysisException: Parquet type not yet supported: INT32 (TIME_MILLIS); at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.typeNotImplemented$1(ParquetSchemaConverter.scala:105) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:141) Thanks, Mohit From: Mike Thomsen Sent: 30 May 2018 17:28 To: users@nifi.apache.org Subject: Re: Unable to read the Parquet file written by NiFi through Spark when Logical Data Type is set to true. What's the error from Spark? Logical data types are just a variant on existing data types in Avro 1.8. On Wed, May 30, 2018 at 7:54 AM Mohit mailto:mohit.j...@open-insights.co.in> > wrote: Hi all, I’m fetching the data from RDBMS and writing it to parquet using PutParquet processor. I’m not able to read the data from Spark when Logical Data Type is true. I’m able to read it from Hive. Do I have to set some specific properties in the PutParquet processor to make it readable from spark as well? Regards, Mohit
Unable to read the Parquet file written by NiFi through Spark when Logical Data Type is set to true.
Hi all, I'm fetching the data from RDBMS and writing it to parquet using PutParquet processor. I'm not able to read the data from Spark when Logical Data Type is true. I'm able to read it from Hive. Do I have to set some specific properties in the PutParquet processor to make it readable from spark as well? Regards, Mohit
RE: PutDatabaseRecord Error while ingesting to Netezza
Hi, describe hive table – I have attached the desc hive table schema – I’m using the SelectHiveQL processor and setting the CSV header to true. Then I’m using the CSVReader with Schema Access Strategy – Use String Fields from Header. PS - I tried to write to MySQL and it is working fine with both CSVReader and AvroReader. For Netezza it is not able to write. Thanks, Mohit From: Pierre Villard <pierre.villard...@gmail.com> Sent: 24 May 2018 13:49 To: users@nifi.apache.org Subject: Re: PutDatabaseRecord Error while ingesting to Netezza Hi, Can we have more details: describe of the Hive table and schema you're using in the processor? Pierre 2018-05-24 8:26 GMT+02:00 Mohit <mohit.j...@open-insights.co.in <mailto:mohit.j...@open-insights.co.in> >: Hi all, I’m facing the error while writing the data to Netezza using put database records. I get the following error – PutDatabaseRecord[id=90aad845-0163-1000--28414a59] Failed to process StandardFlowFileRecord[uuid=f06b1510-9f66-42c1-a0f8-29ffc7a4d30f,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1527141341281-340, container=default, section=340], offset=572216, length=52],offset=0,name=8068481294062298.0.csv,size=35] due to None of the fields in the record map to the columns defined by the test_data table: I’m reading the data from hive and writing to the flowfile in the csv format. Please let me know if I’m doing anything wrong. Thanks, Mohit cal_dt string cal_tm string event_typ_cdstring tndr_typ_cd string lgl_entity_nbr int lgl_entity_sub_idint ntwk_id int site_id int lane_nbr int chanl_cd string id_nbrstring id_sys_nbrint id_sys_issuer_nbr int rd_amt string tt_store_sys_ord_amt string per_id int tran_nbrbigint tran_nbrbigint term_id int lat_coord string long_coordstring device_present_indstring cnsmr_id_cntint loy_cnsmr_id_cntint ndc_ord_qty int ndc_ord_amt string que_trade_item_qty int trade_item_purch_amtstring pos_trade_item_purch_amtstring trade_item_purch_qtyint gtz_trade_item_qty int gtz_trade_item_amt string vent_tms string data_srcstring data_qual_cdstring mngng_cd string batch_nbrint Detailed Table Information Database: test Owner: hdp-df CreateTime: Fri May 25 09:16:35 EDT 2018 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs:// Table Type: EXTERNAL_TABLE Table Parameters: EXTERNALTRUE numFiles1 totalSize 266050048 transient_lastDdlTime 1527254195 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Compressed: No Num Buckets:-1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format1
PutDatabaseRecord Error while ingesting to Netezza
Hi all, I'm facing the error while writing the data to Netezza using put database records. I get the following error - PutDatabaseRecord[id=90aad845-0163-1000--28414a59] Failed to process StandardFlowFileRecord[uuid=f06b1510-9f66-42c1-a0f8-29ffc7a4d30f,claim=Stand ardContentClaim [resourceClaim=StandardResourceClaim[id=1527141341281-340, container=default, section=340], offset=572216, length=52],offset=0,name=8068481294062298.0.csv,size=35] due to None of the fields in the record map to the columns defined by the test_data table: I'm reading the data from hive and writing to the flowfile in the csv format. Please let me know if I'm doing anything wrong. Thanks, Mohit
RE: Nifi Remote Process Group FlowFile Distribution among nodes
Thanks Koji, It was helpful. Mohit. -Original Message- From: Koji Kawamura <ijokaruma...@gmail.com> Sent: 09 May 2018 05:53 To: users@nifi.apache.org Subject: Re: Nifi Remote Process Group FlowFile Distribution among nodes Hi, You can see the description for default behavior by hovering your mouse on the question mark icon (?) next to the "Batch Settings". In short, for sending (to a remote input port), 500 milliseconds batch duration. For pulling (from a remote output port), 5 seconds batch duration. Thanks, Koji On Tue, May 8, 2018 at 4:56 PM, Mohit <mohit.j...@open-insights.co.in> wrote: > Hi, > > What are the configurations in default batch settings? > > Thanks, > Mohit > > -Original Message- > From: Koji Kawamura <ijokaruma...@gmail.com> > Sent: 08 May 2018 13:19 > To: users@nifi.apache.org > Subject: Re: Nifi Remote Process Group FlowFile Distribution among > nodes > > Hi Mohit, > > NiFi RPG batches multiple FlowFiles into the same Site-to-Site transaction, > and the default batch settings are configured for higher throughput. > If you prefer more granular distribution, you can lower the batch > configurations from "Manage Remote Ports" context menu of a > RemoteProcessGroup. > The batch size configuration from UI is added since NiFi 1.2.0, and the JIRA > can be a reference. > https://issues.apache.org/jira/browse/NIFI-1202 > > Thanks, > Koji > > On Tue, May 8, 2018 at 2:24 PM, Mohit <mohit.j...@open-insights.co.in> wrote: >> Hi, >> >> >> >> I need some clarity on how flowfile is distributed among different >> nodes in a Nifi cluster. >> >> >> >> I have a flow where I’m using GenerateTableFetch to fetch the data >> from database. Source table has around 40 million records. I tried >> with different partition size which led to create different number of >> flowfiles. >> >> When there are less number of flowfiles(~40), RPG sends it to only >> one node(in a 4 node cluster) but when there are large number of >> flowfiles(~400), it distribute the flowfile among all the nodes. >> >> Are there some rules or best practices to fine tune the flow, so that >> the flowfiles are evenly distributed across the nodes even if there >> are less number of flowfiles. >> >> >> >> Regards, >> >> Mohit >
RE: Nifi Remote Process Group FlowFile Distribution among nodes
Hi, What are the configurations in default batch settings? Thanks, Mohit -Original Message- From: Koji Kawamura <ijokaruma...@gmail.com> Sent: 08 May 2018 13:19 To: users@nifi.apache.org Subject: Re: Nifi Remote Process Group FlowFile Distribution among nodes Hi Mohit, NiFi RPG batches multiple FlowFiles into the same Site-to-Site transaction, and the default batch settings are configured for higher throughput. If you prefer more granular distribution, you can lower the batch configurations from "Manage Remote Ports" context menu of a RemoteProcessGroup. The batch size configuration from UI is added since NiFi 1.2.0, and the JIRA can be a reference. https://issues.apache.org/jira/browse/NIFI-1202 Thanks, Koji On Tue, May 8, 2018 at 2:24 PM, Mohit <mohit.j...@open-insights.co.in> wrote: > Hi, > > > > I need some clarity on how flowfile is distributed among different > nodes in a Nifi cluster. > > > > I have a flow where I’m using GenerateTableFetch to fetch the data > from database. Source table has around 40 million records. I tried > with different partition size which led to create different number of > flowfiles. > > When there are less number of flowfiles(~40), RPG sends it to only one > node(in a 4 node cluster) but when there are large number of > flowfiles(~400), it distribute the flowfile among all the nodes. > > Are there some rules or best practices to fine tune the flow, so that > the flowfiles are evenly distributed across the nodes even if there > are less number of flowfiles. > > > > Regards, > > Mohit
RE: Validate Record issue
Mark, Thanks for the input. I’ll implement the same. Regards, Mohit From: Mark Payne <marka...@hotmail.com> Sent: 02 May 2018 23:13 To: users@nifi.apache.org Subject: Re: Validate Record issue Mohit, Correct - I was saying that it *should* allow that but due to a bug (NIFI-5141) it currently does not. So in the meantime, you would have to update your schema to allow for either double or int (or long, if you prefer) types. Thanks -Mark On May 2, 2018, at 1:38 PM, Mohit <mohit.j...@open-insights.co.in <mailto:mohit.j...@open-insights.co.in> > wrote: Hi Mark, I set the Strict type checking to false, still it doesn’t allowed. Thanks, Mohit From: Mark Payne <marka...@hotmail.com <mailto:marka...@hotmail.com> > Sent: 02 May 2018 23:00 To: users@nifi.apache.org <mailto:users@nifi.apache.org> Subject: Re: Validate Record issue Mohit, If you look at the Provenance events that are emitted by the processor, they show the reason that the records are considered invalid. Specifically, for this use case, it shows: The following 2 fields had values whose type did not match the schema: [/hs_kbps, /site_id] It appears that in your incoming data, the values are integers, instead of doubles. If you have the "Strict Type Checking" value set to "false" in the processor, then it should allow this. Unfortunately, though, it appears that there is a bug that causes integer values not to be considered validate when the schema says that it is a double. I created a JIRA [1] for this. In the meantime, if you update your schema to allow for those fields to be ["null", "long", "double"] then you should be good. Thanks -Mark [1] <https://issues.apache.org/jira/browse/NIFI-5141> https://issues.apache.org/jira/browse/NIFI-5141 On May 2, 2018, at 11:57 AM, Mohit < <mailto:mohit.j...@open-insights.co.in> mohit.j...@open-insights.co.in> wrote: Hi, I’m using ValidateRecord processor to validate CSV and writing to avro. For a file it is transferring all the record to an invalid relationship. It is working fine with ConvertCsvToAvro processor. Avro Schema - {"type":"record","name":"cell_kpi_dump_geo","namespace":"cell_kpi_dump_geo","fields":[{"name":"month","type":["null","string"],"default":null}, {"name":"cell","type":["null","string"],"default":null},{"name":"availability","type":["int","null"],"default":0},{"name":"cssr_speech","type":["int","null"],"default":0}, {"name":"dcr_speech","type":["int","null"],"default":0},{"name":"hs_kbps","type":["double","null"],"default":0.0},{"name":"eul_kbps","type":["int","null"],"default":0}, {"name":"tech","type":["null","string"],"default":null},{"name":"site_id","type":["double","null"],"default":0.0},{"name":"longitude","type":["double","null"],"default":0.0},{"name":"latitude","type":["double","null"],"default":0.0}]} Sample record – May-16,KA4371D,95,100,0,151,,2G,4371,-1.606926,6.67223 Is there something I’m doing wrong? Regards, Mohit
RE: Validate Record issue
Hi Mark, I set the Strict type checking to false, still it doesn’t allowed. Thanks, Mohit From: Mark Payne <marka...@hotmail.com> Sent: 02 May 2018 23:00 To: users@nifi.apache.org Subject: Re: Validate Record issue Mohit, If you look at the Provenance events that are emitted by the processor, they show the reason that the records are considered invalid. Specifically, for this use case, it shows: The following 2 fields had values whose type did not match the schema: [/hs_kbps, /site_id] It appears that in your incoming data, the values are integers, instead of doubles. If you have the "Strict Type Checking" value set to "false" in the processor, then it should allow this. Unfortunately, though, it appears that there is a bug that causes integer values not to be considered validate when the schema says that it is a double. I created a JIRA [1] for this. In the meantime, if you update your schema to allow for those fields to be ["null", "long", "double"] then you should be good. Thanks -Mark [1] https://issues.apache.org/jira/browse/NIFI-5141 On May 2, 2018, at 11:57 AM, Mohit <mohit.j...@open-insights.co.in <mailto:mohit.j...@open-insights.co.in> > wrote: Hi, I’m using ValidateRecord processor to validate CSV and writing to avro. For a file it is transferring all the record to an invalid relationship. It is working fine with ConvertCsvToAvro processor. Avro Schema - {"type":"record","name":"cell_kpi_dump_geo","namespace":"cell_kpi_dump_geo","fields":[{"name":"month","type":["null","string"],"default":null}, {"name":"cell","type":["null","string"],"default":null},{"name":"availability","type":["int","null"],"default":0},{"name":"cssr_speech","type":["int","null"],"default":0}, {"name":"dcr_speech","type":["int","null"],"default":0},{"name":"hs_kbps","type":["double","null"],"default":0.0},{"name":"eul_kbps","type":["int","null"],"default":0}, {"name":"tech","type":["null","string"],"default":null},{"name":"site_id","type":["double","null"],"default":0.0},{"name":"longitude","type":["double","null"],"default":0.0},{"name":"latitude","type":["double","null"],"default":0.0}]} Sample record – May-16,KA4371D,95,100,0,151,,2G,4371,-1.606926,6.67223 Is there something I’m doing wrong? Regards, Mohit
Validate Record issue
Hi, I'm using ValidateRecord processor to validate CSV and writing to avro. For a file it is transferring all the record to an invalid relationship. It is working fine with ConvertCsvToAvro processor. Avro Schema - {"type":"record","name":"cell_kpi_dump_geo","namespace":"cell_kpi_dump_geo", "fields":[{"name":"month","type":["null","string"],"default":null}, {"name":"cell","type":["null","string"],"default":null},{"name":"availabilit y","type":["int","null"],"default":0},{"name":"cssr_speech","type":["int","n ull"],"default":0}, {"name":"dcr_speech","type":["int","null"],"default":0},{"name":"hs_kbps","t ype":["double","null"],"default":0.0},{"name":"eul_kbps","type":["int","null "],"default":0}, {"name":"tech","type":["null","string"],"default":null},{"name":"site_id","t ype":["double","null"],"default":0.0},{"name":"longitude","type":["double"," null"],"default":0.0},{"name":"latitude","type":["double","null"],"default": 0.0}]} Sample record - May-16,KA4371D,95,100,0,151,,2G,4371,-1.606926,6.67223 Is there something I'm doing wrong? Regards, Mohit
RE: Nifi 1.6.0 ValidateRecord Processor- AvroRecordSetWriter issue
Matt, I have used all default compression settings as of now, i.e., it is set to none for AvroRecordSetWriter, PutHDFS, ConverAvroToORC. Regards, Mohit -Original Message- From: Matt Burgess <mattyb...@apache.org> Sent: 24 April 2018 20:58 To: users@nifi.apache.org Subject: Re: Nifi 1.6.0 ValidateRecord Processor- AvroRecordSetWriter issue Mohit, What are you using for compression settings for ConvertCSVToAvro, ConvertAvroToORC, AvroRecordSetWriter, PutHDFS, etc.? The default for ConvertCSVToAvro is Snappy where the rest default to None (although IIRC ConvertAvroToORC will retain an incoming Snappy codec in the outgoing ORC). If you used all defaults, I'm surprised the exact opposite isn't the case, as we've seen issues with processors like CompressContent using Snappy not working with Hive tables. Also, are you using the same schema in ConvertCSVToAvro as you are in AvroRecordSetWriter? If you view the Avro flow files (in the UI, listing the queue then viewing a flow file in the queue) before ConvertAvroToORC, do they appear to look the same in terms of the data and the data types? Regards, Matt On Tue, Apr 24, 2018 at 9:29 AM, Mohit <mohit.j...@open-insights.co.in> wrote: > Hi, > > I'm using the hive.ddl attribute formed by ConvertAvroToORC for the DDL > statement, and passing it to custom processor(ReplaceText route the file to > failure if maximum buffer size is more than 1 mb size. For large data it is > not a good option.) which modify the content of the flowfile to the hive.ddl > statement + location of external table. I don’t think this is causing any > issue. > > I tried to create the table on the top of avro file as well, it also displays > null. > > In the AvroRecordSetWriter doc, it says " Writes the contents of a RecordSet > in Binary Avro format. ". Is this format different than what ConvertCsvToAvro > writes? Because same flow is working with ValidateRecord(write it to csv) > +ConvertCsvToAvro processor. > > Thanks, > Mohit > > -Original Message- > From: Matt Burgess <mattyb...@apache.org> > Sent: 24 April 2018 18:43 > To: users@nifi.apache.org > Subject: Re: Nifi 1.6.0 ValidateRecord Processor- AvroRecordSetWriter > issue > > Mohit, > > Can you share the config for your ConvertAvroToORC processor? Also, by > "CreateHiveTable", do you mean ReplaceText (to set the content to the > hive.ddl attribute formed by ConvertAvroToORC) -> PutHiveQL (to execute the > DDL)? If not, are you using a custom processor or ExecuteStreamCommand or > something else? > > If you are not using the generated DDL to create the table, can you share > your CREATE TABLE statement for the target table? I'm guessing there's a > mismatch somewhere between the data and the table definition. > > Regards, > Matt > > > On Tue, Apr 24, 2018 at 9:09 AM, Mohit <mohit.j...@open-insights.co.in> wrote: >> Hi all, >> >> >> >> I’m using ValidateRecord processor to validate the csv and convert it >> into Avro. Later, I convert this avro to orc using ConvertAvroToORC >> processor and write it to hdfs and create a hive table on the top of it. >> >> When I query the table, it displays null, though the record count is >> matching. >> >> >> >> Flow - ValidateRecord -> ConvertAvroToORC -> PutHDFS -> >> CreateHiveTable >> >> >> >> >> >> To debug, I also tried to write the avro data to hdfs and created the >> hive table on the top of it. It is also displaying null results. >> >> >> >> Flow - ValidateRecord -> ConvertCSVToAvro -> PutHDFS I manually created >> hive table with avro format. >> >> >> >> >> >> When I use ValidateRecord + ConvertCSVToAvro, it is working fine. >> >> Flow - ValidateRecord -> ConvertCSVToAvro -> ConvertAvroToORC -> >> PutHDFS >> -> CreateHiveTable >> >> >> >> >> >> Is there anything I’m doing wrong? >> >> >> >> Thanks, >> >> Mohit >> >> >> >> >
RE: Nifi 1.6.0 ValidateRecord Processor- AvroRecordSetWriter issue
Hi, I'm using the hive.ddl attribute formed by ConvertAvroToORC for the DDL statement, and passing it to custom processor(ReplaceText route the file to failure if maximum buffer size is more than 1 mb size. For large data it is not a good option.) which modify the content of the flowfile to the hive.ddl statement + location of external table. I don’t think this is causing any issue. I tried to create the table on the top of avro file as well, it also displays null. In the AvroRecordSetWriter doc, it says " Writes the contents of a RecordSet in Binary Avro format. ". Is this format different than what ConvertCsvToAvro writes? Because same flow is working with ValidateRecord(write it to csv) +ConvertCsvToAvro processor. Thanks, Mohit -Original Message- From: Matt Burgess <mattyb...@apache.org> Sent: 24 April 2018 18:43 To: users@nifi.apache.org Subject: Re: Nifi 1.6.0 ValidateRecord Processor- AvroRecordSetWriter issue Mohit, Can you share the config for your ConvertAvroToORC processor? Also, by "CreateHiveTable", do you mean ReplaceText (to set the content to the hive.ddl attribute formed by ConvertAvroToORC) -> PutHiveQL (to execute the DDL)? If not, are you using a custom processor or ExecuteStreamCommand or something else? If you are not using the generated DDL to create the table, can you share your CREATE TABLE statement for the target table? I'm guessing there's a mismatch somewhere between the data and the table definition. Regards, Matt On Tue, Apr 24, 2018 at 9:09 AM, Mohit <mohit.j...@open-insights.co.in> wrote: > Hi all, > > > > I’m using ValidateRecord processor to validate the csv and convert it > into Avro. Later, I convert this avro to orc using ConvertAvroToORC > processor and write it to hdfs and create a hive table on the top of it. > > When I query the table, it displays null, though the record count is > matching. > > > > Flow - ValidateRecord -> ConvertAvroToORC -> PutHDFS -> > CreateHiveTable > > > > > > To debug, I also tried to write the avro data to hdfs and created the > hive table on the top of it. It is also displaying null results. > > > > Flow - ValidateRecord -> ConvertCSVToAvro -> PutHDFS I manually created > hive table with avro format. > > > > > > When I use ValidateRecord + ConvertCSVToAvro, it is working fine. > > Flow - ValidateRecord -> ConvertCSVToAvro -> ConvertAvroToORC -> > PutHDFS > -> CreateHiveTable > > > > > > Is there anything I’m doing wrong? > > > > Thanks, > > Mohit > > > >
Nifi 1.6.0 ValidateRecord Processor- AvroRecordSetWriter issue
Hi all, I'm using ValidateRecord processor to validate the csv and convert it into Avro. Later, I convert this avro to orc using ConvertAvroToORC processor and write it to hdfs and create a hive table on the top of it. When I query the table, it displays null, though the record count is matching. Flow - ValidateRecord -> ConvertAvroToORC -> PutHDFS -> CreateHiveTable To debug, I also tried to write the avro data to hdfs and created the hive table on the top of it. It is also displaying null results. Flow - ValidateRecord -> ConvertCSVToAvro -> PutHDFS I manually created hive table with avro format. When I use ValidateRecord + ConvertCSVToAvro, it is working fine. Flow - ValidateRecord -> ConvertCSVToAvro -> ConvertAvroToORC -> PutHDFS -> CreateHiveTable Is there anything I'm doing wrong? Thanks, Mohit
RE: Generate a surrogate key/sequence for each avro record
I'm able to achieve it using QueryRecord processor. Thanks, Mohit From: Mohit <mohit.j...@open-insights.co.in> Sent: 23 April 2018 15:59 To: users@nifi.apache.org Subject: Generate a surrogate key/sequence for each avro record Hi, I'm trying to added some extra info like a surrogate key/sequence number to the data fetch from rdbms. Is there a way to generate surrogate key for each avro record in nifi? Regards, Mohit
Generate a surrogate key/sequence for each avro record
Hi, I'm trying to added some extra info like a surrogate key/sequence number to the data fetch from rdbms. Is there a way to generate surrogate key for each avro record in nifi? Regards, Mohit
Nifi Custom processor code coverage
Hi, I'm testing a custom processor. I've used a customValidate(ValidationContext context) method to validate the properties. When I run coverage, that piece of code is not covered. I've used assertNotVaild() but it doesn't increase my code coverage. Is there any way to test that part? My code coverage is affected by it. It would be helpful if there is any document available for nifi-mock. Regards, Mohit
RE: NiFi 1.6
Joe, What is the expected date? From: Joe WittSent: 06 April 2018 02:20 To: users@nifi.apache.org Subject: Re: NiFi 1.6 dan It is presently working through the release candidate vote process. As it stands now it could be out tomorrow. Please do help by reviewing the rc if you have time. If you have questions on how to do it just let us know and we can help. thanks joe On Thu, Apr 5, 2018, 1:46 PM dan young > wrote: any updates on when 1.6 is going to drop? dano
RE: ConvertCSVToAvro taking a lot of time when passing single record as an input.
Hi Mike, I intentionally did this, just to check how the processor handles invalid record. Thanks, Mohit From: Mike Thomsen <mikerthom...@gmail.com> Sent: 03 April 2018 18:17 To: users@nifi.apache.org Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. Mohit, Looking at your schema: { "type": "record", "name": "test", "namespace": "test", "fields": [{ "name": "name", "type": ["null", "int"], "default": null }, { "name": "age", "type": ["null", "string"], "default": null }] } It looks like you have your fields' types backwards. (Name should be string, age should be int) On Tue, Apr 3, 2018 at 8:41 AM, Mohit <mohit.j...@open-insights.co.in <mailto:mohit.j...@open-insights.co.in> > wrote: Pierre, Thanks for the information. This would be really helpful if Nifi 1.6.0 releases this week. There is a lot of pending tasks dependent on it. Mohit From: Pierre Villard <pierre.villard...@gmail.com <mailto:pierre.villard...@gmail.com> > Sent: 03 April 2018 18:03 To: users@nifi.apache.org <mailto:users@nifi.apache.org> Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. Hi Mohit, This has been fixed with https://issues.apache.org/jira/browse/NIFI-4955. Besides, what Marks suggested with NIFI-4883 <https://issues.apache.org/jira/browse/NIFI-4883> is now merged in master. Both will be in NiFi 1.6.0 to be released (hopefully this week). Pierre 2018-04-03 12:37 GMT+02:00 Mohit <mohit.j...@open-insights.co.in <mailto:mohit.j...@open-insights.co.in> >: Hi, I am using the ValidateRecord processor, but seems like it change the order of the data. When I set the property Include Header Line to true, I found that the record isn’t corrupted but the order is varied. For example- Actual order – bsc,cell_name,site_id,site_name,longitude,latitude,status,region,districts_216,area,town_city,cgi,cell_id,new_id,azimuth,cell_type,territory_name ATEBSC2,AC0139B,0139,0139LA_PALM,-0.14072,5.56353,Operational,Greater Accra Region,LA DADE-KOTOPON MUNICIPAL,LA_PALM,LABADI,62001-152-1392,1392,2332401392,60,2G,ACCRA METRO MAIN Order after converting using CSVRecordSetWriter- cgi,latitude,territory_name,azimuth,cell_type,cell_id,longitude,cell_name,area,new_id,districts_216,site_name,town_city,bsc,site_id,region,status 62001-152-1392,5.56353,ACCRA METRO MAIN,60,2G,1392,-0.14072,AC0139B,LA_PALM,2332401392,LA DADE-KOTOPON MUNICIPAL,0139LA_PALM,LABADI,ATEBSC2,0139,Greater Accra Region,Operational Is there any way to maintain the order of the record? Thanks, Mohit From: Mark Payne <marka...@hotmail.com <mailto:marka...@hotmail.com> > Sent: 02 April 2018 20:23 To: users@nifi.apache.org <mailto:users@nifi.apache.org> Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. Mohit, I see. I think this is an issue because the Avro Writer expects that the data must be in the proper schema, or else it will throw an Exception when trying to write the data. To address this, we should update ValidateRecord to support a different Record Writer to use for valid data vs. invalid data. There already is a JIRA [1] for this improvement. In the meantime, it probably makes sense to use a CSV Reader and a CSV Writer for the Validate Record processor, then use ConvertRecord only for the valid records. Or, since you're running into this issue it may make sense for your use case to continue on with the ConvertCSVToAvro processor for now. But splitting the records up to run against that Processor may result in lower performance, as you've noted. Thanks -Mark [1] https://issues.apache.org/jira/browse/NIFI-4883 On Apr 2, 2018, at 10:26 AM, Mohit <mohit.j...@open-insights.co.in <mailto:mohit.j...@open-insights.co.in> > wrote: Mark, Error:- ValidateRecord[id=5a9c3616-ab7c-17c1--e6c2fc5d] ValidateRecord[id=5a9c3616-ab7c-17c1--e6c2fc5d] failed to process due to org.apache.nifi.serialization.record.util.IllegalTypeConversionException: Cannot convert value mohit of type class java.lang.String because no compatible types exist in the UNION for field name; rolling back session: Cannot convert value mohit of type class java.lang.String because no compatible types exist in the UNION for field name I have a file with only one record :- m
RE: ConvertCSVToAvro taking a lot of time when passing single record as an input.
Pierre, Thanks for the information. This would be really helpful if Nifi 1.6.0 releases this week. There is a lot of pending tasks dependent on it. Mohit From: Pierre Villard <pierre.villard...@gmail.com> Sent: 03 April 2018 18:03 To: users@nifi.apache.org Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. Hi Mohit, This has been fixed with https://issues.apache.org/jira/browse/NIFI-4955. Besides, what Marks suggested with NIFI-4883 <https://issues.apache.org/jira/browse/NIFI-4883> is now merged in master. Both will be in NiFi 1.6.0 to be released (hopefully this week). Pierre 2018-04-03 12:37 GMT+02:00 Mohit <mohit.j...@open-insights.co.in <mailto:mohit.j...@open-insights.co.in> >: Hi, I am using the ValidateRecord processor, but seems like it change the order of the data. When I set the property Include Header Line to true, I found that the record isn’t corrupted but the order is varied. For example- Actual order – bsc,cell_name,site_id,site_name,longitude,latitude,status,region,districts_216,area,town_city,cgi,cell_id,new_id,azimuth,cell_type,territory_name ATEBSC2,AC0139B,0139,0139LA_PALM,-0.14072,5.56353,Operational,Greater Accra Region,LA DADE-KOTOPON MUNICIPAL,LA_PALM,LABADI,62001-152-1392,1392,2332401392,60,2G,ACCRA METRO MAIN Order after converting using CSVRecordSetWriter- cgi,latitude,territory_name,azimuth,cell_type,cell_id,longitude,cell_name,area,new_id,districts_216,site_name,town_city,bsc,site_id,region,status 62001-152-1392,5.56353,ACCRA METRO MAIN,60,2G,1392,-0.14072,AC0139B,LA_PALM,2332401392,LA DADE-KOTOPON MUNICIPAL,0139LA_PALM,LABADI,ATEBSC2,0139,Greater Accra Region,Operational Is there any way to maintain the order of the record? Thanks, Mohit From: Mark Payne <marka...@hotmail.com <mailto:marka...@hotmail.com> > Sent: 02 April 2018 20:23 To: users@nifi.apache.org <mailto:users@nifi.apache.org> Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. Mohit, I see. I think this is an issue because the Avro Writer expects that the data must be in the proper schema, or else it will throw an Exception when trying to write the data. To address this, we should update ValidateRecord to support a different Record Writer to use for valid data vs. invalid data. There already is a JIRA [1] for this improvement. In the meantime, it probably makes sense to use a CSV Reader and a CSV Writer for the Validate Record processor, then use ConvertRecord only for the valid records. Or, since you're running into this issue it may make sense for your use case to continue on with the ConvertCSVToAvro processor for now. But splitting the records up to run against that Processor may result in lower performance, as you've noted. Thanks -Mark [1] https://issues.apache.org/jira/browse/NIFI-4883 On Apr 2, 2018, at 10:26 AM, Mohit <mohit.j...@open-insights.co.in <mailto:mohit.j...@open-insights.co.in> > wrote: Mark, Error:- ValidateRecord[id=5a9c3616-ab7c-17c1--e6c2fc5d] ValidateRecord[id=5a9c3616-ab7c-17c1--e6c2fc5d] failed to process due to org.apache.nifi.serialization.record.util.IllegalTypeConversionException: Cannot convert value mohit of type class java.lang.String because no compatible types exist in the UNION for field name; rolling back session: Cannot convert value mohit of type class java.lang.String because no compatible types exist in the UNION for field name I have a file with only one record :- mohit,25 Just to check how it works, I’ve given incorrect schema: (int for string field) {"type":"record","name":"test","namespace":"test","fields":[{"name":"name","type":["null","int"],"default":null},{"name":"age","type":["null","string"],"default":null}]} It doesn’t pass the record to invalid relationship. But it keeps the file in the queue prior to validateRecord processor. Mohit From: Mark Payne <marka...@hotmail.com <mailto:marka...@hotmail.com> > Sent: 02 April 2018 19:53 To: users@nifi.apache.org <mailto:users@nifi.apache.org> Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. What is the error that you're seeing? On Apr 2, 2018, at 10:22 AM, Mohit < <mailto:mohit.j...@open-insights.co.in> mohit.j...@open-insights.co.in> wrote: Hi Mark, I tried the ValidateRecord processor, it is converting the flowfile if it is valid. But If the records are not valid, it is passing to the invalid relationship. Instead it keeps on throwing bulletins keeping the flowfile in the queue. Any suggestion? Mohit From: Mark Payne
RE: ConvertCSVToAvro taking a lot of time when passing single record as an input.
Hi, I am using the ValidateRecord processor, but seems like it change the order of the data. When I set the property Include Header Line to true, I found that the record isn’t corrupted but the order is varied. For example- Actual order – bsc,cell_name,site_id,site_name,longitude,latitude,status,region,districts_216,area,town_city,cgi,cell_id,new_id,azimuth,cell_type,territory_name ATEBSC2,AC0139B,0139,0139LA_PALM,-0.14072,5.56353,Operational,Greater Accra Region,LA DADE-KOTOPON MUNICIPAL,LA_PALM,LABADI,62001-152-1392,1392,2332401392,60,2G,ACCRA METRO MAIN Order after converting using CSVRecordSetWriter- cgi,latitude,territory_name,azimuth,cell_type,cell_id,longitude,cell_name,area,new_id,districts_216,site_name,town_city,bsc,site_id,region,status 62001-152-1392,5.56353,ACCRA METRO MAIN,60,2G,1392,-0.14072,AC0139B,LA_PALM,2332401392,LA DADE-KOTOPON MUNICIPAL,0139LA_PALM,LABADI,ATEBSC2,0139,Greater Accra Region,Operational Is there any way to maintain the order of the record? Thanks, Mohit From: Mark Payne <marka...@hotmail.com> Sent: 02 April 2018 20:23 To: users@nifi.apache.org Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. Mohit, I see. I think this is an issue because the Avro Writer expects that the data must be in the proper schema, or else it will throw an Exception when trying to write the data. To address this, we should update ValidateRecord to support a different Record Writer to use for valid data vs. invalid data. There already is a JIRA [1] for this improvement. In the meantime, it probably makes sense to use a CSV Reader and a CSV Writer for the Validate Record processor, then use ConvertRecord only for the valid records. Or, since you're running into this issue it may make sense for your use case to continue on with the ConvertCSVToAvro processor for now. But splitting the records up to run against that Processor may result in lower performance, as you've noted. Thanks -Mark [1] https://issues.apache.org/jira/browse/NIFI-4883 On Apr 2, 2018, at 10:26 AM, Mohit <mohit.j...@open-insights.co.in <mailto:mohit.j...@open-insights.co.in> > wrote: Mark, Error:- ValidateRecord[id=5a9c3616-ab7c-17c1--e6c2fc5d] ValidateRecord[id=5a9c3616-ab7c-17c1--e6c2fc5d] failed to process due to org.apache.nifi.serialization.record.util.IllegalTypeConversionException: Cannot convert value mohit of type class java.lang.String because no compatible types exist in the UNION for field name; rolling back session: Cannot convert value mohit of type class java.lang.String because no compatible types exist in the UNION for field name I have a file with only one record :- mohit,25 Just to check how it works, I’ve given incorrect schema: (int for string field) {"type":"record","name":"test","namespace":"test","fields":[{"name":"name","type":["null","int"],"default":null},{"name":"age","type":["null","string"],"default":null}]} It doesn’t pass the record to invalid relationship. But it keeps the file in the queue prior to validateRecord processor. Mohit From: Mark Payne <marka...@hotmail.com <mailto:marka...@hotmail.com> > Sent: 02 April 2018 19:53 To: users@nifi.apache.org <mailto:users@nifi.apache.org> Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. What is the error that you're seeing? On Apr 2, 2018, at 10:22 AM, Mohit < <mailto:mohit.j...@open-insights.co.in> mohit.j...@open-insights.co.in> wrote: Hi Mark, I tried the ValidateRecord processor, it is converting the flowfile if it is valid. But If the records are not valid, it is passing to the invalid relationship. Instead it keeps on throwing bulletins keeping the flowfile in the queue. Any suggestion? Mohit From: Mark Payne < <mailto:marka...@hotmail.com> marka...@hotmail.com> Sent: 02 April 2018 19:02 To: <mailto:users@nifi.apache.org> users@nifi.apache.org Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. Mohit, You can certainly dial back that number of Concurrent Tasks. Setting that to something like 10 is a pretty big number. Setting it to a thousand means that you'll likely starve out other processors that are waiting on a thread and will generally perform a lot worse because you have 1,000 different threads competing with each other to try to pull the next FlowFile. You can use the ValidateRecord processor and configure a schema that indicates what you expect the data to look like. Then you can route any invalid records to one route and valid records to another route. This will ensure
RE: ConvertCSVToAvro taking a lot of time when passing single record as an input.
Mark, Error:- ValidateRecord[id=5a9c3616-ab7c-17c1--e6c2fc5d] ValidateRecord[id=5a9c3616-ab7c-17c1--e6c2fc5d] failed to process due to org.apache.nifi.serialization.record.util.IllegalTypeConversionException: Cannot convert value mohit of type class java.lang.String because no compatible types exist in the UNION for field name; rolling back session: Cannot convert value mohit of type class java.lang.String because no compatible types exist in the UNION for field name I have a file with only one record :- mohit,25 Just to check how it works, I’ve given incorrect schema: (int for string field) {"type":"record","name":"test","namespace":"test","fields":[{"name":"name","type":["null","int"],"default":null},{"name":"age","type":["null","string"],"default":null}]} It doesn’t pass the record to invalid relationship. But it keeps the file in the queue prior to validateRecord processor. Mohit From: Mark Payne <marka...@hotmail.com> Sent: 02 April 2018 19:53 To: users@nifi.apache.org Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. What is the error that you're seeing? On Apr 2, 2018, at 10:22 AM, Mohit <mohit.j...@open-insights.co.in <mailto:mohit.j...@open-insights.co.in> > wrote: Hi Mark, I tried the ValidateRecord processor, it is converting the flowfile if it is valid. But If the records are not valid, it is passing to the invalid relationship. Instead it keeps on throwing bulletins keeping the flowfile in the queue. Any suggestion? Mohit From: Mark Payne <marka...@hotmail.com <mailto:marka...@hotmail.com> > Sent: 02 April 2018 19:02 To: users@nifi.apache.org <mailto:users@nifi.apache.org> Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. Mohit, You can certainly dial back that number of Concurrent Tasks. Setting that to something like 10 is a pretty big number. Setting it to a thousand means that you'll likely starve out other processors that are waiting on a thread and will generally perform a lot worse because you have 1,000 different threads competing with each other to try to pull the next FlowFile. You can use the ValidateRecord processor and configure a schema that indicates what you expect the data to look like. Then you can route any invalid records to one route and valid records to another route. This will ensure that all data that goes to the 'valid' relationship is routed one way and any other data is routed to the 'invalid' relationship. Thanks -Mark On Apr 2, 2018, at 9:22 AM, Mohit < <mailto:mohit.j...@open-insights.co.in> mohit.j...@open-insights.co.in> wrote: Hi Mark, The main intention to use such flow is to track bad records. The records which doesn’t get converted should be tracked somewhere. For that purpose I’m using Split-Merge approach. Meanwhile, I’m able to improve the performance by increasing the ‘Concurrent Tasks’ to 1000. Now ConvertCSVToAvro is able to convert 6-7k per second, which though not optimum but quite better than 45-50 records per seconds. Is there any other improvement I can do? Mohit From: Mark Payne < <mailto:marka...@hotmail.com> marka...@hotmail.com> Sent: 02 April 2018 18:30 To: <mailto:users@nifi.apache.org> users@nifi.apache.org Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. Mohit, I agree that 45-50 records per second is quite slow. I'm not very familiar with the implementation of ConvertCSVToAvro, but it may well be that it must perform some sort of initialization for each FlowFile that it receives, which would explain why it's fast for a single incoming FlowFile and slow for a large number. Additionally, when you start splitting the data like that, you're generating a lot more FlowFiles, which means a lot more updates to both the FlowFile Repository and the Provenance Repository. As a result, you're basically taxing the NiFi framework far more than if you keep the data as a single FlowFile. On my laptop, though, I would expect more than 45-50 FlowFiles per second through most processors, but I don't know what kind of hardware you are running on. In general, though, it is best to keep data together instead of splitting it apart. Since the ConvertCSVToAvro can handle many CSV records, is there a reason to split the data to begin with? Also, I would recommend you look at using the Record-based processors [1][2] such as ConvertRecord instead of the ConvertABCtoXYZ processors, as those are older processors and often don't work as well and the Record-oriented processors often allow you
RE: ConvertCSVToAvro taking a lot of time when passing single record as an input.
Hi Mark, I tried the ValidateRecord processor, it is converting the flowfile if it is valid. But If the records are not valid, it is passing to the invalid relationship. Instead it keeps on throwing bulletins keeping the flowfile in the queue. Any suggestion? Mohit From: Mark Payne <marka...@hotmail.com> Sent: 02 April 2018 19:02 To: users@nifi.apache.org Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. Mohit, You can certainly dial back that number of Concurrent Tasks. Setting that to something like 10 is a pretty big number. Setting it to a thousand means that you'll likely starve out other processors that are waiting on a thread and will generally perform a lot worse because you have 1,000 different threads competing with each other to try to pull the next FlowFile. You can use the ValidateRecord processor and configure a schema that indicates what you expect the data to look like. Then you can route any invalid records to one route and valid records to another route. This will ensure that all data that goes to the 'valid' relationship is routed one way and any other data is routed to the 'invalid' relationship. Thanks -Mark On Apr 2, 2018, at 9:22 AM, Mohit <mohit.j...@open-insights.co.in <mailto:mohit.j...@open-insights.co.in> > wrote: Hi Mark, The main intention to use such flow is to track bad records. The records which doesn’t get converted should be tracked somewhere. For that purpose I’m using Split-Merge approach. Meanwhile, I’m able to improve the performance by increasing the ‘Concurrent Tasks’ to 1000. Now ConvertCSVToAvro is able to convert 6-7k per second, which though not optimum but quite better than 45-50 records per seconds. Is there any other improvement I can do? Mohit From: Mark Payne <marka...@hotmail.com <mailto:marka...@hotmail.com> > Sent: 02 April 2018 18:30 To: users@nifi.apache.org <mailto:users@nifi.apache.org> Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. Mohit, I agree that 45-50 records per second is quite slow. I'm not very familiar with the implementation of ConvertCSVToAvro, but it may well be that it must perform some sort of initialization for each FlowFile that it receives, which would explain why it's fast for a single incoming FlowFile and slow for a large number. Additionally, when you start splitting the data like that, you're generating a lot more FlowFiles, which means a lot more updates to both the FlowFile Repository and the Provenance Repository. As a result, you're basically taxing the NiFi framework far more than if you keep the data as a single FlowFile. On my laptop, though, I would expect more than 45-50 FlowFiles per second through most processors, but I don't know what kind of hardware you are running on. In general, though, it is best to keep data together instead of splitting it apart. Since the ConvertCSVToAvro can handle many CSV records, is there a reason to split the data to begin with? Also, I would recommend you look at using the Record-based processors [1][2] such as ConvertRecord instead of the ConvertABCtoXYZ processors, as those are older processors and often don't work as well and the Record-oriented processors often allow you to keep data together as a single FlowFile throughout your entire flow, which makes the performance far better and makes the flow much easier to design. Thanks -Mark [1] <https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi> https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi [2] <https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries> https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries On Apr 2, 2018, at 8:49 AM, Mohit < <mailto:mohit.j...@open-insights.co.in> mohit.j...@open-insights.co.in> wrote: Hi, I’m trying to capture bad records from ConvertCSVToAvro processor. For that, I’m using two SplitText processors in a row to create chunks and then each record per flow file. My flow is - ListFile -> FetchFile -> SplitText(1 records) -> SplitText(1 record) -> ConvertCSVToAvro -> *(futher processing) I have a 10 MB file with 15 columns per row and 64000 records. Normal flow (without SplitText) completes in few seconds. But when I’m using the above flow, ConvertCSVToAvro processor works drastically slow(45-50 rec/sec). I’m not able to conclude where I’m doing wrong in the flow. I’m using Nifi 1.5.0 . Any quick input would be appreciated. Thanks, Mohit
RE: ConvertCSVToAvro taking a lot of time when passing single record as an input.
Mark, Agree with you. I’ll try the ValidateRecord Processor, if it serve the purpose. Thanks, Mohit From: Mark Payne <marka...@hotmail.com> Sent: 02 April 2018 19:02 To: users@nifi.apache.org Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. Mohit, You can certainly dial back that number of Concurrent Tasks. Setting that to something like 10 is a pretty big number. Setting it to a thousand means that you'll likely starve out other processors that are waiting on a thread and will generally perform a lot worse because you have 1,000 different threads competing with each other to try to pull the next FlowFile. You can use the ValidateRecord processor and configure a schema that indicates what you expect the data to look like. Then you can route any invalid records to one route and valid records to another route. This will ensure that all data that goes to the 'valid' relationship is routed one way and any other data is routed to the 'invalid' relationship. Thanks -Mark On Apr 2, 2018, at 9:22 AM, Mohit <mohit.j...@open-insights.co.in <mailto:mohit.j...@open-insights.co.in> > wrote: Hi Mark, The main intention to use such flow is to track bad records. The records which doesn’t get converted should be tracked somewhere. For that purpose I’m using Split-Merge approach. Meanwhile, I’m able to improve the performance by increasing the ‘Concurrent Tasks’ to 1000. Now ConvertCSVToAvro is able to convert 6-7k per second, which though not optimum but quite better than 45-50 records per seconds. Is there any other improvement I can do? Mohit From: Mark Payne <marka...@hotmail.com <mailto:marka...@hotmail.com> > Sent: 02 April 2018 18:30 To: users@nifi.apache.org <mailto:users@nifi.apache.org> Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. Mohit, I agree that 45-50 records per second is quite slow. I'm not very familiar with the implementation of ConvertCSVToAvro, but it may well be that it must perform some sort of initialization for each FlowFile that it receives, which would explain why it's fast for a single incoming FlowFile and slow for a large number. Additionally, when you start splitting the data like that, you're generating a lot more FlowFiles, which means a lot more updates to both the FlowFile Repository and the Provenance Repository. As a result, you're basically taxing the NiFi framework far more than if you keep the data as a single FlowFile. On my laptop, though, I would expect more than 45-50 FlowFiles per second through most processors, but I don't know what kind of hardware you are running on. In general, though, it is best to keep data together instead of splitting it apart. Since the ConvertCSVToAvro can handle many CSV records, is there a reason to split the data to begin with? Also, I would recommend you look at using the Record-based processors [1][2] such as ConvertRecord instead of the ConvertABCtoXYZ processors, as those are older processors and often don't work as well and the Record-oriented processors often allow you to keep data together as a single FlowFile throughout your entire flow, which makes the performance far better and makes the flow much easier to design. Thanks -Mark [1] <https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi> https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi [2] <https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries> https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries On Apr 2, 2018, at 8:49 AM, Mohit < <mailto:mohit.j...@open-insights.co.in> mohit.j...@open-insights.co.in> wrote: Hi, I’m trying to capture bad records from ConvertCSVToAvro processor. For that, I’m using two SplitText processors in a row to create chunks and then each record per flow file. My flow is - ListFile -> FetchFile -> SplitText(1 records) -> SplitText(1 record) -> ConvertCSVToAvro -> *(futher processing) I have a 10 MB file with 15 columns per row and 64000 records. Normal flow (without SplitText) completes in few seconds. But when I’m using the above flow, ConvertCSVToAvro processor works drastically slow(45-50 rec/sec). I’m not able to conclude where I’m doing wrong in the flow. I’m using Nifi 1.5.0 . Any quick input would be appreciated. Thanks, Mohit
RE: ConvertCSVToAvro taking a lot of time when passing single record as an input.
Hi Mark, The main intention to use such flow is to track bad records. The records which doesn’t get converted should be tracked somewhere. For that purpose I’m using Split-Merge approach. Meanwhile, I’m able to improve the performance by increasing the ‘Concurrent Tasks’ to 1000. Now ConvertCSVToAvro is able to convert 6-7k per second, which though not optimum but quite better than 45-50 records per seconds. Is there any other improvement I can do? Mohit From: Mark Payne <marka...@hotmail.com> Sent: 02 April 2018 18:30 To: users@nifi.apache.org Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input. Mohit, I agree that 45-50 records per second is quite slow. I'm not very familiar with the implementation of ConvertCSVToAvro, but it may well be that it must perform some sort of initialization for each FlowFile that it receives, which would explain why it's fast for a single incoming FlowFile and slow for a large number. Additionally, when you start splitting the data like that, you're generating a lot more FlowFiles, which means a lot more updates to both the FlowFile Repository and the Provenance Repository. As a result, you're basically taxing the NiFi framework far more than if you keep the data as a single FlowFile. On my laptop, though, I would expect more than 45-50 FlowFiles per second through most processors, but I don't know what kind of hardware you are running on. In general, though, it is best to keep data together instead of splitting it apart. Since the ConvertCSVToAvro can handle many CSV records, is there a reason to split the data to begin with? Also, I would recommend you look at using the Record-based processors [1][2] such as ConvertRecord instead of the ConvertABCtoXYZ processors, as those are older processors and often don't work as well and the Record-oriented processors often allow you to keep data together as a single FlowFile throughout your entire flow, which makes the performance far better and makes the flow much easier to design. Thanks -Mark [1] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi [2] https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries On Apr 2, 2018, at 8:49 AM, Mohit <mohit.j...@open-insights.co.in <mailto:mohit.j...@open-insights.co.in> > wrote: Hi, I’m trying to capture bad records from ConvertCSVToAvro processor. For that, I’m using two SplitText processors in a row to create chunks and then each record per flow file. My flow is - ListFile -> FetchFile -> SplitText(1 records) -> SplitText(1 record) -> ConvertCSVToAvro -> *(futher processing) I have a 10 MB file with 15 columns per row and 64000 records. Normal flow (without SplitText) completes in few seconds. But when I’m using the above flow, ConvertCSVToAvro processor works drastically slow(45-50 rec/sec). I’m not able to conclude where I’m doing wrong in the flow. I’m using Nifi 1.5.0 . Any quick input would be appreciated. Thanks, Mohit
ConvertCSVToAvro taking a lot of time when passing single record as an input.
Hi, I'm trying to capture bad records from ConvertCSVToAvro processor. For that, I'm using two SplitText processors in a row to create chunks and then each record per flow file. My flow is - ListFile -> FetchFile -> SplitText(1 records) -> SplitText(1 record) -> ConvertCSVToAvro -> *(futher processing) I have a 10 MB file with 15 columns per row and 64000 records. Normal flow (without SplitText) completes in few seconds. But when I'm using the above flow, ConvertCSVToAvro processor works drastically slow(45-50 rec/sec). I'm not able to conclude where I'm doing wrong in the flow. I'm using Nifi 1.5.0 . Any quick input would be appreciated. Thanks, Mohit