RE: Nifi vs Sqoop

2016-11-10 Thread Provenzano Nicolas
Thanks Bryan.

De : Bryan Bende [mailto:bbe...@gmail.com]
Envoyé : jeudi 10 novembre 2016 15:26
À : users@nifi.apache.org
Objet : Re: Nifi vs Sqoop

Hello,

I can't speak to a direct comparison between NiFi and sqoop, but I can say that 
sqoop is a specific tool that was built just for database extraction, so it can 
probably do some things NiFi can't, since NiFi is a general purpose data flow 
tool.

That being said, NiFi does have the ability to extraction from relation 
databases...

The GenerateTableFetch processor [1] would likely be what you want for more of 
a bulk-extraction, and QueryDatabaseTable [2] for incremental fetching

I believe the "Maximum Value Columns" property on QueryDatabaseTable is how you 
achieve finding new rows since last execution.

Thanks,

Bryan

[1] 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.GenerateTableFetch/index.html
[2] 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html


On Wed, Nov 9, 2016 at 4:37 AM, Provenzano Nicolas 
<nicolas.provenz...@gfi.fr<mailto:nicolas.provenz...@gfi.fr>> wrote:
Hi all,

I have the following requirements :


• I need to load at day 1 a full SQL table,

• And then need to incrementally load new data (using capture data 
change mechanism).

Initially, I was thinking using Sqoop to do it.

Looking at Nifi and especially the QueryDatabaseTable processor, I’m wondering 
if I could use Nifi instead.

Has someone already compared both to do it and what were the outcomes ?

I can’t see however how to configure the QueryDatabaseTable to handle the new 
lines (for example, looking at a “lastmodificationdate” field and taking only 
the lines for which lastModificationDate > lastRequestDate) ?

Thanks in advance

BR

Nicolas



RE: Nifi vs Sqoop

2016-11-10 Thread Provenzano Nicolas
Hi Matt, 

It fully answers to my question. 

Thanks and regards,

Nicolas

-Message d'origine-
De : Matt Burgess [mailto:mattyb...@apache.org] 
Envoyé : jeudi 10 novembre 2016 15:32
À : users@nifi.apache.org
Objet : Re: Nifi vs Sqoop

Nicolas,

The Max Value Columns property of QueryDatabaseTable is the specification by 
which the processor fetches only the new lines. In your case you would put 
"lastmodificationdate" as the Max Value Column. The first time the processor is 
triggered, it will execute a "SELECT * from myTable" and get all the rows (as 
it does not yet know about "new" vs "old" rows). Then for the Max Value Column, 
it will keep track of the maximum value currently observed for that column.
The next time the processor is triggered, it will execute a "SELECT * FROM 
myTable WHERE lastModificationDate > the_max_value_seen_so_far".
Thus only rows whose value for the Max Value Column is greater than the current 
maximum will be returned. Then the maximum is again updated, and so on.

Does this answer your question(about QueryDatabaseTable)? If not please let me 
know.

If your source table is large and/or you'd like to parallelize the fetching of 
rows from the table, consider the GenerateTableFetch processor [1] instead. 
Rather than _executing_ SQL like QueryDatabaseTable does, GenerateTableFetch 
_generates_ SQL, and will generate a number of flow files, each containing a 
SQL statement that grabs X rows from the table. If you supply a Max Value 
Column here, it too will perform incremental fetch after the initial one. These 
flow files can be distributed throughout your cluster (using a 
RemoteProcessGroup pointing to the same cluster, and an Input Port to receive 
the flow files), creating a parallel distributed fetch capability like Sqoop. 
From a scaling perspective, Sqoop uses MapReduce so it can scale with the size 
of your Hadoop cluster.
GenerateTableFetch can scale to the size of your NiFi cluster. You might choose 
NiFi or Sqoop based on the volume and velocity of your data.

Regards,
Matt

[1] 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.GenerateTableFetch/index.html

On Wed, Nov 9, 2016 at 4:37 AM, Provenzano Nicolas <nicolas.provenz...@gfi.fr> 
wrote:
> Hi all,
>
>
>
> I have the following requirements :
>
>
>
> · I need to load at day 1 a full SQL table,
>
> · And then need to incrementally load new data (using capture data
> change mechanism).
>
>
>
> Initially, I was thinking using Sqoop to do it.
>
>
>
> Looking at Nifi and especially the QueryDatabaseTable processor, I’m 
> wondering if I could use Nifi instead.
>
>
>
> Has someone already compared both to do it and what were the outcomes ?
>
>
>
> I can’t see however how to configure the QueryDatabaseTable to handle 
> the new lines (for example, looking at a “lastmodificationdate” field 
> and taking only the lines for which lastModificationDate > lastRequestDate) ?
>
>
>
> Thanks in advance
>
>
>
> BR
>
>
>
> Nicolas


Re: Nifi vs Sqoop

2016-11-10 Thread Matt Burgess
Nicolas,

The Max Value Columns property of QueryDatabaseTable is the
specification by which the processor fetches only the new lines. In
your case you would put "lastmodificationdate" as the Max Value
Column. The first time the processor is triggered, it will execute a
"SELECT * from myTable" and get all the rows (as it does not yet know
about "new" vs "old" rows). Then for the Max Value Column, it will
keep track of the maximum value currently observed for that column.
The next time the processor is triggered, it will execute a "SELECT *
FROM myTable WHERE lastModificationDate > the_max_value_seen_so_far".
Thus only rows whose value for the Max Value Column is greater than
the current maximum will be returned. Then the maximum is again
updated, and so on.

Does this answer your question(about QueryDatabaseTable)? If not
please let me know.

If your source table is large and/or you'd like to parallelize the
fetching of rows from the table, consider the GenerateTableFetch
processor [1] instead. Rather than _executing_ SQL like
QueryDatabaseTable does, GenerateTableFetch _generates_ SQL, and will
generate a number of flow files, each containing a SQL statement that
grabs X rows from the table. If you supply a Max Value Column here, it
too will perform incremental fetch after the initial one. These flow
files can be distributed throughout your cluster (using a
RemoteProcessGroup pointing to the same cluster, and an Input Port to
receive the flow files), creating a parallel distributed fetch
capability like Sqoop. From a scaling perspective, Sqoop uses
MapReduce so it can scale with the size of your Hadoop cluster.
GenerateTableFetch can scale to the size of your NiFi cluster. You
might choose NiFi or Sqoop based on the volume and velocity of your
data.

Regards,
Matt

[1] 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.GenerateTableFetch/index.html

On Wed, Nov 9, 2016 at 4:37 AM, Provenzano Nicolas
 wrote:
> Hi all,
>
>
>
> I have the following requirements :
>
>
>
> · I need to load at day 1 a full SQL table,
>
> · And then need to incrementally load new data (using capture data
> change mechanism).
>
>
>
> Initially, I was thinking using Sqoop to do it.
>
>
>
> Looking at Nifi and especially the QueryDatabaseTable processor, I’m
> wondering if I could use Nifi instead.
>
>
>
> Has someone already compared both to do it and what were the outcomes ?
>
>
>
> I can’t see however how to configure the QueryDatabaseTable to handle the
> new lines (for example, looking at a “lastmodificationdate” field and taking
> only the lines for which lastModificationDate > lastRequestDate) ?
>
>
>
> Thanks in advance
>
>
>
> BR
>
>
>
> Nicolas


Re: Nifi vs Sqoop

2016-11-10 Thread Bryan Bende
Hello,

I can't speak to a direct comparison between NiFi and sqoop, but I can say
that sqoop is a specific tool that was built just for database extraction,
so it can probably do some things NiFi can't, since NiFi is a general
purpose data flow tool.

That being said, NiFi does have the ability to extraction from relation
databases...

The GenerateTableFetch processor [1] would likely be what you want for more
of a bulk-extraction, and QueryDatabaseTable [2] for incremental fetching

I believe the "Maximum Value Columns" property on QueryDatabaseTable is how
you achieve finding new rows since last execution.

Thanks,

Bryan

[1]
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.GenerateTableFetch/index.html
[2]
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html


On Wed, Nov 9, 2016 at 4:37 AM, Provenzano Nicolas <
nicolas.provenz...@gfi.fr> wrote:

> Hi all,
>
>
>
> I have the following requirements :
>
>
>
> · I need to load at day 1 a full SQL table,
>
> · And then need to incrementally load new data (using capture
> data change mechanism).
>
>
>
> Initially, I was thinking using Sqoop to do it.
>
>
>
> Looking at Nifi and especially the QueryDatabaseTable processor, I’m
> wondering if I could use Nifi instead.
>
>
>
> Has someone already compared both to do it and what were the outcomes ?
>
>
>
> I can’t see however how to configure the QueryDatabaseTable to handle the
> new lines (for example, looking at a “lastmodificationdate” field and
> taking only the lines for which lastModificationDate > lastRequestDate) ?
>
>
>
> Thanks in advance
>
>
>
> BR
>
>
>
> Nicolas
>