Re: how to merge small parqut files in the hudi location

2019-04-07 Thread Vinoth Chandar
Hi Rahul,

It definitely seems like the the number of commits to retain is not getting
passed in correctly for KEEP_LATEST_COMMITS.  Those changes you describe
should not be affecting this. This is a pure Hudi level config.

Look forward to the log.

Thanks
Vinoth

On Fri, Apr 5, 2019 at 1:22 AM [email protected] <
[email protected]> wrote:

>
>
> On 2019/03/13 14:22:34, Vinoth Chandar  wrote:
> > Another quick check. Are all 180 files part of the same file group i.e
> > begin with the same uuid prefix in its name?
> >
> > On Wed, Mar 13, 2019 at 7:14 AM Vinoth Chandar 
> wrote:
> >
> > > Hi rahul,
> > >
> > > From the timeline, it does seem line cleaning happens regularly.  Can
> you
> > > share the logs from the driver in a gist?
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Wed, Mar 13, 2019 at 5:58 AM [email protected] <
> > > [email protected]> wrote:
> > >
> > >>
> > >>
> > >> On 2019/03/13 08:42:13, Vinoth Chandar  wrote:
> > >> > Hi Rahul,
> > >> >
> > >> > Good to know. Yes for copy_on_write please turn off inline
> compaction.
> > >> > (Probably explains why the default was false).
> > >> >
> > >> > Thanks
> > >> > Vinoth
> > >> >
> > >> > On Wed, Mar 13, 2019 at 12:51 AM [email protected] <
> > >> > [email protected]> wrote:
> > >> >
> > >> > >
> > >> > >
> > >> > > On 2019/03/12 23:04:43, Vinoth Chandar  wrote:
> > >> > > > Opened up https://github.com/uber/hudi/pull/599/files to
> improve
> > >> this
> > >> > > > out-of-box
> > >> > > >
> > >> > > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar <
> [email protected]>
> > >> > > wrote:
> > >> > > >
> > >> > > > > Hi Rahul,
> > >> > > > >
> > >> > > > > The files you shared all belong to same file group (they share
> > >> the same
> > >> > > > > prefix if you notice) (
> > >> > > https://hudi.apache.org/concepts.html#terminologies
> > >> > > > > ).
> > >> > > > > Given its not creating new file groups every run, means the
> > >> feature is
> > >> > > > > kicking in.
> > >> > > > >
> > >> > > > > During each insert, Hudi will find the latest file in each
> file
> > >> group
> > >> > > (I,e
> > >> > > > > the one with largest instant time, timestamp) and
> rewrite/expand
> > >> that
> > >> > > with
> > >> > > > > the new inserts. Hudi does not clean up the old files
> immediately,
> > >> > > since
> > >> > > > > that can cause running queries to fail, since they could have
> > >> started
> > >> > > even
> > >> > > > > hours ago (e.g Hive).
> > >> > > > >
> > >> > > > > If you want to reduce the number of files you see, you can
> lower
> > >> > > number of
> > >> > > > > commits retained
> > >> > > > > https://hudi.apache.org/configurations.html#retainCommits
> > >> > > > > We retain 24 by default.. i.e after the 25th file, the first
> one
> > >> will
> > >> > > be
> > >> > > > > automatically cleaned..
> > >> > > > >
> > >> > > > > Does that make sense? Are you able to query this data and
> find the
> > >> > > > > expected records?
> > >> > > > >
> > >> > > > > Thanks
> > >> > > > > Vinoth
> > >> > > > >
> > >> > > > > On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
> > >> > > > > [email protected]> wrote:
> > >> > > > >
> > >> > > > >>
> > >> > > > >>
> > >> > > > >> On 2019/03/11 18:25:46, Vinoth Chandar 
> > >> wrote:
> > >> > > > >> > Hi Rahul,
> > >> > > > >> >
> > >> > > > >> > Hudi/Copy-on-write storage would keep expanding your
> existing
> > >> > > parquet
> > >> > > > >> files
> > >> > > > >> > to reach the configured file size, once you set the small
> file
> > >> size
> > >> > > > >> > config..
> > >> > > > >> >
> > >> > > > >> > For e.g: we at uber, write 1GB files this way.. to do
> that, you
> > >> > > could
> > >> > > > >> set
> > >> > > > >> > something like this.
> > >> > > > >> > http://hudi.apache.org/configurations.html#limitFileSize
> = 1
> > >> *
> > >> > > 1024 *
> > >> > > > >> 1024
> > >> > > > >> > * 1024
> > >> > > > >> >
> > >> http://hudi.apache.org/configurations.html#compactionSmallFileSize
> > >> > > =
> > >> > > > >> 900 *
> > >> > > > >> > 1024 * 1024
> > >> > > > >> >
> > >> > > > >> >
> > >> > > > >> > Please let me know if you have trouble achieving this. Also
> > >> please
> > >> > > use
> > >> > > > >> the
> > >> > > > >> > insert operation (not bulk_insert) for this to work
> > >> > > > >> >
> > >> > > > >> >
> > >> > > > >> > Thanks
> > >> > > > >> > Vinoth
> > >> > > > >> >
> > >> > > > >> > On Mon, Mar 11, 2019 at 12:32 AM [email protected]
> <
> > >> > > > >> > [email protected]> wrote:
> > >> > > > >> >
> > >> > > > >> > >
> > >> > > > >> > >
> > >> > > > >> > > On 2019/03/08 13:43:52, Vinoth Chandar <
> [email protected]>
> > >> wrote:
> > >> > > > >> > > > Hi Rahul,
> > >> > > > >> > > >
> > >> > > > >> > > > you can try adding
> > >> hoodie.parquet.small.file.limit=104857600, to
> > >> > > > >> your
> > >> > > > >> > > > property file to specify 100MB files. Note that this
> works
> > >> only
> > >> > > if
> > >> > > > >> y

Re: how to merge small parqut files in the hudi location

2019-04-05 Thread rahuledavalath



On 2019/03/13 14:22:34, Vinoth Chandar  wrote: 
> Another quick check. Are all 180 files part of the same file group i.e
> begin with the same uuid prefix in its name?
> 
> On Wed, Mar 13, 2019 at 7:14 AM Vinoth Chandar  wrote:
> 
> > Hi rahul,
> >
> > From the timeline, it does seem line cleaning happens regularly.  Can you
> > share the logs from the driver in a gist?
> >
> > Thanks
> > Vinoth
> >
> > On Wed, Mar 13, 2019 at 5:58 AM [email protected] <
> > [email protected]> wrote:
> >
> >>
> >>
> >> On 2019/03/13 08:42:13, Vinoth Chandar  wrote:
> >> > Hi Rahul,
> >> >
> >> > Good to know. Yes for copy_on_write please turn off inline compaction.
> >> > (Probably explains why the default was false).
> >> >
> >> > Thanks
> >> > Vinoth
> >> >
> >> > On Wed, Mar 13, 2019 at 12:51 AM [email protected] <
> >> > [email protected]> wrote:
> >> >
> >> > >
> >> > >
> >> > > On 2019/03/12 23:04:43, Vinoth Chandar  wrote:
> >> > > > Opened up https://github.com/uber/hudi/pull/599/files to improve
> >> this
> >> > > > out-of-box
> >> > > >
> >> > > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar 
> >> > > wrote:
> >> > > >
> >> > > > > Hi Rahul,
> >> > > > >
> >> > > > > The files you shared all belong to same file group (they share
> >> the same
> >> > > > > prefix if you notice) (
> >> > > https://hudi.apache.org/concepts.html#terminologies
> >> > > > > ).
> >> > > > > Given its not creating new file groups every run, means the
> >> feature is
> >> > > > > kicking in.
> >> > > > >
> >> > > > > During each insert, Hudi will find the latest file in each file
> >> group
> >> > > (I,e
> >> > > > > the one with largest instant time, timestamp) and rewrite/expand
> >> that
> >> > > with
> >> > > > > the new inserts. Hudi does not clean up the old files immediately,
> >> > > since
> >> > > > > that can cause running queries to fail, since they could have
> >> started
> >> > > even
> >> > > > > hours ago (e.g Hive).
> >> > > > >
> >> > > > > If you want to reduce the number of files you see, you can lower
> >> > > number of
> >> > > > > commits retained
> >> > > > > https://hudi.apache.org/configurations.html#retainCommits
> >> > > > > We retain 24 by default.. i.e after the 25th file, the first one
> >> will
> >> > > be
> >> > > > > automatically cleaned..
> >> > > > >
> >> > > > > Does that make sense? Are you able to query this data and find the
> >> > > > > expected records?
> >> > > > >
> >> > > > > Thanks
> >> > > > > Vinoth
> >> > > > >
> >> > > > > On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
> >> > > > > [email protected]> wrote:
> >> > > > >
> >> > > > >>
> >> > > > >>
> >> > > > >> On 2019/03/11 18:25:46, Vinoth Chandar 
> >> wrote:
> >> > > > >> > Hi Rahul,
> >> > > > >> >
> >> > > > >> > Hudi/Copy-on-write storage would keep expanding your existing
> >> > > parquet
> >> > > > >> files
> >> > > > >> > to reach the configured file size, once you set the small file
> >> size
> >> > > > >> > config..
> >> > > > >> >
> >> > > > >> > For e.g: we at uber, write 1GB files this way.. to do that, you
> >> > > could
> >> > > > >> set
> >> > > > >> > something like this.
> >> > > > >> > http://hudi.apache.org/configurations.html#limitFileSize  = 1
> >> *
> >> > > 1024 *
> >> > > > >> 1024
> >> > > > >> > * 1024
> >> > > > >> >
> >> http://hudi.apache.org/configurations.html#compactionSmallFileSize
> >> > > =
> >> > > > >> 900 *
> >> > > > >> > 1024 * 1024
> >> > > > >> >
> >> > > > >> >
> >> > > > >> > Please let me know if you have trouble achieving this. Also
> >> please
> >> > > use
> >> > > > >> the
> >> > > > >> > insert operation (not bulk_insert) for this to work
> >> > > > >> >
> >> > > > >> >
> >> > > > >> > Thanks
> >> > > > >> > Vinoth
> >> > > > >> >
> >> > > > >> > On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
> >> > > > >> > [email protected]> wrote:
> >> > > > >> >
> >> > > > >> > >
> >> > > > >> > >
> >> > > > >> > > On 2019/03/08 13:43:52, Vinoth Chandar 
> >> wrote:
> >> > > > >> > > > Hi Rahul,
> >> > > > >> > > >
> >> > > > >> > > > you can try adding
> >> hoodie.parquet.small.file.limit=104857600, to
> >> > > > >> your
> >> > > > >> > > > property file to specify 100MB files. Note that this works
> >> only
> >> > > if
> >> > > > >> you
> >> > > > >> > > are
> >> > > > >> > > > using insert (not bulk_insert) operation. Hudi will
> >> enforce file
> >> > > > >> sizing
> >> > > > >> > > on
> >> > > > >> > > > ingest time. As of now, there is no support for collapsing
> >> these
> >> > > > >> file
> >> > > > >> > > > groups (parquet + related log files) into a large file
> >> group
> >> > > > >> (HIP/Design
> >> > > > >> > > > may come soon). Does that help?
> >> > > > >> > > >
> >> > > > >> > > > Also on the compaction in general, since you don't have any
> >> > > updates.
> >> > > > >> > > > I think you can simply use the copy_on_write storage?
> >> inserts
> >> > > will
> >> > > > >> go to
> >> > > > >> > > > the parquet file a

Re: how to merge small parqut files in the hudi location

2019-04-04 Thread nishith agarwal
Rahul,

Please make sure you are also setting the following config :

"hoodie.cleaner.policy" -> This config supports 2 policies :
KEEP_LATEST_FILE_VERSIONS,
KEEP_LATEST_COMMITS (This is the default policy)


If you are cleaning based on latest file versions, please set the
policy to KEEP_LATEST_FILE_VERSIONS

-Nishith


On Thu, Apr 4, 2019 at 9:03 AM Vinoth Chandar  wrote:

> Hi rahul,
>
> Can you paste logs related to HoodieCleaner? That could give us clues
>
> Thanks
> Vinoth
>
> On Wed, Apr 3, 2019 at 11:00 PM [email protected] <
> [email protected]> wrote:
>
> >
> >
> > On 2019/04/04 00:41:15, Vinoth Chandar  wrote:
> > > Hi Rahul,
> > >
> > > Sorry not following fully.. Are you saying cleaning is not triggered at
> > all
> > > or is cleaner not reclaiming older files? This definitely should be
> > > working,. So its mostly some config issue
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Wed, Apr 3, 2019 at 6:27 AM [email protected] <
> > > [email protected]> wrote:
> > >
> > > >
> > > >
> > > > On 2019/03/13 12:57:59, [email protected] <
> > [email protected]>
> > > > wrote:
> > > > >
> > > > >
> > > > > On 2019/03/13 08:42:13, Vinoth Chandar  wrote:
> > > > > > Hi Rahul,
> > > > > >
> > > > > > Good to know. Yes for copy_on_write please turn off inline
> > compaction.
> > > > > > (Probably explains why the default was false).
> > > > > >
> > > > > > Thanks
> > > > > > Vinoth
> > > > > >
> > > > > > On Wed, Mar 13, 2019 at 12:51 AM [email protected] <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 2019/03/12 23:04:43, Vinoth Chandar 
> > wrote:
> > > > > > > > Opened up https://github.com/uber/hudi/pull/599/files to
> > improve
> > > > this
> > > > > > > > out-of-box
> > > > > > > >
> > > > > > > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar <
> > [email protected]>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Rahul,
> > > > > > > > >
> > > > > > > > > The files you shared all belong to same file group (they
> > share
> > > > the same
> > > > > > > > > prefix if you notice) (
> > > > > > > https://hudi.apache.org/concepts.html#terminologies
> > > > > > > > > ).
> > > > > > > > > Given its not creating new file groups every run, means the
> > > > feature is
> > > > > > > > > kicking in.
> > > > > > > > >
> > > > > > > > > During each insert, Hudi will find the latest file in each
> > file
> > > > group
> > > > > > > (I,e
> > > > > > > > > the one with largest instant time, timestamp) and
> > rewrite/expand
> > > > that
> > > > > > > with
> > > > > > > > > the new inserts. Hudi does not clean up the old files
> > > > immediately,
> > > > > > > since
> > > > > > > > > that can cause running queries to fail, since they could
> have
> > > > started
> > > > > > > even
> > > > > > > > > hours ago (e.g Hive).
> > > > > > > > >
> > > > > > > > > If you want to reduce the number of files you see, you can
> > lower
> > > > > > > number of
> > > > > > > > > commits retained
> > > > > > > > > https://hudi.apache.org/configurations.html#retainCommits
> > > > > > > > > We retain 24 by default.. i.e after the 25th file, the
> first
> > one
> > > > will
> > > > > > > be
> > > > > > > > > automatically cleaned..
> > > > > > > > >
> > > > > > > > > Does that make sense? Are you able to query this data and
> > find
> > > > the
> > > > > > > > > expected records?
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > > Vinoth
> > > > > > > > >
> > > > > > > > > On Tue, Mar 12, 2019 at 12:23 PM [email protected]
> <
> > > > > > > > > [email protected]> wrote:
> > > > > > > > >
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> On 2019/03/11 18:25:46, Vinoth Chandar  >
> > > > wrote:
> > > > > > > > >> > Hi Rahul,
> > > > > > > > >> >
> > > > > > > > >> > Hudi/Copy-on-write storage would keep expanding your
> > existing
> > > > > > > parquet
> > > > > > > > >> files
> > > > > > > > >> > to reach the configured file size, once you set the
> small
> > > > file size
> > > > > > > > >> > config..
> > > > > > > > >> >
> > > > > > > > >> > For e.g: we at uber, write 1GB files this way.. to do
> > that,
> > > > you
> > > > > > > could
> > > > > > > > >> set
> > > > > > > > >> > something like this.
> > > > > > > > >> >
> http://hudi.apache.org/configurations.html#limitFileSize
> > =
> > > > 1 *
> > > > > > > 1024 *
> > > > > > > > >> 1024
> > > > > > > > >> > * 1024
> > > > > > > > >> >
> > > > http://hudi.apache.org/configurations.html#compactionSmallFileSize
> > > > > > > =
> > > > > > > > >> 900 *
> > > > > > > > >> > 1024 * 1024
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > Please let me know if you have trouble achieving this.
> > Also
> > > > please
> > > > > > > use
> > > > > > > > >> the
> > > > > > > > >> > insert operation (not bulk_insert) for this to work
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > Thanks
> > > > > > > > >> > Vinoth
> > > > 

Re: how to merge small parqut files in the hudi location

2019-04-04 Thread Vinoth Chandar
Hi rahul,

Can you paste logs related to HoodieCleaner? That could give us clues

Thanks
Vinoth

On Wed, Apr 3, 2019 at 11:00 PM [email protected] <
[email protected]> wrote:

>
>
> On 2019/04/04 00:41:15, Vinoth Chandar  wrote:
> > Hi Rahul,
> >
> > Sorry not following fully.. Are you saying cleaning is not triggered at
> all
> > or is cleaner not reclaiming older files? This definitely should be
> > working,. So its mostly some config issue
> >
> > Thanks
> > Vinoth
> >
> > On Wed, Apr 3, 2019 at 6:27 AM [email protected] <
> > [email protected]> wrote:
> >
> > >
> > >
> > > On 2019/03/13 12:57:59, [email protected] <
> [email protected]>
> > > wrote:
> > > >
> > > >
> > > > On 2019/03/13 08:42:13, Vinoth Chandar  wrote:
> > > > > Hi Rahul,
> > > > >
> > > > > Good to know. Yes for copy_on_write please turn off inline
> compaction.
> > > > > (Probably explains why the default was false).
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > > > On Wed, Mar 13, 2019 at 12:51 AM [email protected] <
> > > > > [email protected]> wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > On 2019/03/12 23:04:43, Vinoth Chandar 
> wrote:
> > > > > > > Opened up https://github.com/uber/hudi/pull/599/files to
> improve
> > > this
> > > > > > > out-of-box
> > > > > > >
> > > > > > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar <
> [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Rahul,
> > > > > > > >
> > > > > > > > The files you shared all belong to same file group (they
> share
> > > the same
> > > > > > > > prefix if you notice) (
> > > > > > https://hudi.apache.org/concepts.html#terminologies
> > > > > > > > ).
> > > > > > > > Given its not creating new file groups every run, means the
> > > feature is
> > > > > > > > kicking in.
> > > > > > > >
> > > > > > > > During each insert, Hudi will find the latest file in each
> file
> > > group
> > > > > > (I,e
> > > > > > > > the one with largest instant time, timestamp) and
> rewrite/expand
> > > that
> > > > > > with
> > > > > > > > the new inserts. Hudi does not clean up the old files
> > > immediately,
> > > > > > since
> > > > > > > > that can cause running queries to fail, since they could have
> > > started
> > > > > > even
> > > > > > > > hours ago (e.g Hive).
> > > > > > > >
> > > > > > > > If you want to reduce the number of files you see, you can
> lower
> > > > > > number of
> > > > > > > > commits retained
> > > > > > > > https://hudi.apache.org/configurations.html#retainCommits
> > > > > > > > We retain 24 by default.. i.e after the 25th file, the first
> one
> > > will
> > > > > > be
> > > > > > > > automatically cleaned..
> > > > > > > >
> > > > > > > > Does that make sense? Are you able to query this data and
> find
> > > the
> > > > > > > > expected records?
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Vinoth
> > > > > > > >
> > > > > > > > On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
> > > > > > > > [email protected]> wrote:
> > > > > > > >
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On 2019/03/11 18:25:46, Vinoth Chandar 
> > > wrote:
> > > > > > > >> > Hi Rahul,
> > > > > > > >> >
> > > > > > > >> > Hudi/Copy-on-write storage would keep expanding your
> existing
> > > > > > parquet
> > > > > > > >> files
> > > > > > > >> > to reach the configured file size, once you set the small
> > > file size
> > > > > > > >> > config..
> > > > > > > >> >
> > > > > > > >> > For e.g: we at uber, write 1GB files this way.. to do
> that,
> > > you
> > > > > > could
> > > > > > > >> set
> > > > > > > >> > something like this.
> > > > > > > >> > http://hudi.apache.org/configurations.html#limitFileSize
> =
> > > 1 *
> > > > > > 1024 *
> > > > > > > >> 1024
> > > > > > > >> > * 1024
> > > > > > > >> >
> > > http://hudi.apache.org/configurations.html#compactionSmallFileSize
> > > > > > =
> > > > > > > >> 900 *
> > > > > > > >> > 1024 * 1024
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > Please let me know if you have trouble achieving this.
> Also
> > > please
> > > > > > use
> > > > > > > >> the
> > > > > > > >> > insert operation (not bulk_insert) for this to work
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > Thanks
> > > > > > > >> > Vinoth
> > > > > > > >> >
> > > > > > > >> > On Mon, Mar 11, 2019 at 12:32 AM [email protected]
> <
> > > > > > > >> > [email protected]> wrote:
> > > > > > > >> >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > On 2019/03/08 13:43:52, Vinoth Chandar <
> [email protected]>
> > > wrote:
> > > > > > > >> > > > Hi Rahul,
> > > > > > > >> > > >
> > > > > > > >> > > > you can try adding
> > > hoodie.parquet.small.file.limit=104857600, to
> > > > > > > >> your
> > > > > > > >> > > > property file to specify 100MB files. Note that this
> > > works only
> > > > > > if
> > > > > > > >> you
> > > > > > > >> > > are
> > > > > > > >> > > > using insert (not bulk_insert) operation

Re: how to merge small parqut files in the hudi location

2019-04-03 Thread rahuledavalath



On 2019/04/04 00:41:15, Vinoth Chandar  wrote: 
> Hi Rahul,
> 
> Sorry not following fully.. Are you saying cleaning is not triggered at all
> or is cleaner not reclaiming older files? This definitely should be
> working,. So its mostly some config issue
> 
> Thanks
> Vinoth
> 
> On Wed, Apr 3, 2019 at 6:27 AM [email protected] <
> [email protected]> wrote:
> 
> >
> >
> > On 2019/03/13 12:57:59, [email protected] 
> > wrote:
> > >
> > >
> > > On 2019/03/13 08:42:13, Vinoth Chandar  wrote:
> > > > Hi Rahul,
> > > >
> > > > Good to know. Yes for copy_on_write please turn off inline compaction.
> > > > (Probably explains why the default was false).
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Wed, Mar 13, 2019 at 12:51 AM [email protected] <
> > > > [email protected]> wrote:
> > > >
> > > > >
> > > > >
> > > > > On 2019/03/12 23:04:43, Vinoth Chandar  wrote:
> > > > > > Opened up https://github.com/uber/hudi/pull/599/files to improve
> > this
> > > > > > out-of-box
> > > > > >
> > > > > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar 
> > > > > wrote:
> > > > > >
> > > > > > > Hi Rahul,
> > > > > > >
> > > > > > > The files you shared all belong to same file group (they share
> > the same
> > > > > > > prefix if you notice) (
> > > > > https://hudi.apache.org/concepts.html#terminologies
> > > > > > > ).
> > > > > > > Given its not creating new file groups every run, means the
> > feature is
> > > > > > > kicking in.
> > > > > > >
> > > > > > > During each insert, Hudi will find the latest file in each file
> > group
> > > > > (I,e
> > > > > > > the one with largest instant time, timestamp) and rewrite/expand
> > that
> > > > > with
> > > > > > > the new inserts. Hudi does not clean up the old files
> > immediately,
> > > > > since
> > > > > > > that can cause running queries to fail, since they could have
> > started
> > > > > even
> > > > > > > hours ago (e.g Hive).
> > > > > > >
> > > > > > > If you want to reduce the number of files you see, you can lower
> > > > > number of
> > > > > > > commits retained
> > > > > > > https://hudi.apache.org/configurations.html#retainCommits
> > > > > > > We retain 24 by default.. i.e after the 25th file, the first one
> > will
> > > > > be
> > > > > > > automatically cleaned..
> > > > > > >
> > > > > > > Does that make sense? Are you able to query this data and find
> > the
> > > > > > > expected records?
> > > > > > >
> > > > > > > Thanks
> > > > > > > Vinoth
> > > > > > >
> > > > > > > On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
> > > > > > > [email protected]> wrote:
> > > > > > >
> > > > > > >>
> > > > > > >>
> > > > > > >> On 2019/03/11 18:25:46, Vinoth Chandar 
> > wrote:
> > > > > > >> > Hi Rahul,
> > > > > > >> >
> > > > > > >> > Hudi/Copy-on-write storage would keep expanding your existing
> > > > > parquet
> > > > > > >> files
> > > > > > >> > to reach the configured file size, once you set the small
> > file size
> > > > > > >> > config..
> > > > > > >> >
> > > > > > >> > For e.g: we at uber, write 1GB files this way.. to do that,
> > you
> > > > > could
> > > > > > >> set
> > > > > > >> > something like this.
> > > > > > >> > http://hudi.apache.org/configurations.html#limitFileSize  =
> > 1 *
> > > > > 1024 *
> > > > > > >> 1024
> > > > > > >> > * 1024
> > > > > > >> >
> > http://hudi.apache.org/configurations.html#compactionSmallFileSize
> > > > > =
> > > > > > >> 900 *
> > > > > > >> > 1024 * 1024
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Please let me know if you have trouble achieving this. Also
> > please
> > > > > use
> > > > > > >> the
> > > > > > >> > insert operation (not bulk_insert) for this to work
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Thanks
> > > > > > >> > Vinoth
> > > > > > >> >
> > > > > > >> > On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
> > > > > > >> > [email protected]> wrote:
> > > > > > >> >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > On 2019/03/08 13:43:52, Vinoth Chandar 
> > wrote:
> > > > > > >> > > > Hi Rahul,
> > > > > > >> > > >
> > > > > > >> > > > you can try adding
> > hoodie.parquet.small.file.limit=104857600, to
> > > > > > >> your
> > > > > > >> > > > property file to specify 100MB files. Note that this
> > works only
> > > > > if
> > > > > > >> you
> > > > > > >> > > are
> > > > > > >> > > > using insert (not bulk_insert) operation. Hudi will
> > enforce file
> > > > > > >> sizing
> > > > > > >> > > on
> > > > > > >> > > > ingest time. As of now, there is no support for
> > collapsing these
> > > > > > >> file
> > > > > > >> > > > groups (parquet + related log files) into a large file
> > group
> > > > > > >> (HIP/Design
> > > > > > >> > > > may come soon). Does that help?
> > > > > > >> > > >
> > > > > > >> > > > Also on the compaction in general, since you don't have
> > any
> > > > > updates.
> > > > > > >> > > > I think you can simply use the copy_on_write storage?
> > inserts
> > > > > wil

Re: how to merge small parqut files in the hudi location

2019-04-03 Thread Vinoth Chandar
Hi Rahul,

Sorry not following fully.. Are you saying cleaning is not triggered at all
or is cleaner not reclaiming older files? This definitely should be
working,. So its mostly some config issue

Thanks
Vinoth

On Wed, Apr 3, 2019 at 6:27 AM [email protected] <
[email protected]> wrote:

>
>
> On 2019/03/13 12:57:59, [email protected] 
> wrote:
> >
> >
> > On 2019/03/13 08:42:13, Vinoth Chandar  wrote:
> > > Hi Rahul,
> > >
> > > Good to know. Yes for copy_on_write please turn off inline compaction.
> > > (Probably explains why the default was false).
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Wed, Mar 13, 2019 at 12:51 AM [email protected] <
> > > [email protected]> wrote:
> > >
> > > >
> > > >
> > > > On 2019/03/12 23:04:43, Vinoth Chandar  wrote:
> > > > > Opened up https://github.com/uber/hudi/pull/599/files to improve
> this
> > > > > out-of-box
> > > > >
> > > > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar 
> > > > wrote:
> > > > >
> > > > > > Hi Rahul,
> > > > > >
> > > > > > The files you shared all belong to same file group (they share
> the same
> > > > > > prefix if you notice) (
> > > > https://hudi.apache.org/concepts.html#terminologies
> > > > > > ).
> > > > > > Given its not creating new file groups every run, means the
> feature is
> > > > > > kicking in.
> > > > > >
> > > > > > During each insert, Hudi will find the latest file in each file
> group
> > > > (I,e
> > > > > > the one with largest instant time, timestamp) and rewrite/expand
> that
> > > > with
> > > > > > the new inserts. Hudi does not clean up the old files
> immediately,
> > > > since
> > > > > > that can cause running queries to fail, since they could have
> started
> > > > even
> > > > > > hours ago (e.g Hive).
> > > > > >
> > > > > > If you want to reduce the number of files you see, you can lower
> > > > number of
> > > > > > commits retained
> > > > > > https://hudi.apache.org/configurations.html#retainCommits
> > > > > > We retain 24 by default.. i.e after the 25th file, the first one
> will
> > > > be
> > > > > > automatically cleaned..
> > > > > >
> > > > > > Does that make sense? Are you able to query this data and find
> the
> > > > > > expected records?
> > > > > >
> > > > > > Thanks
> > > > > > Vinoth
> > > > > >
> > > > > > On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > >>
> > > > > >>
> > > > > >> On 2019/03/11 18:25:46, Vinoth Chandar 
> wrote:
> > > > > >> > Hi Rahul,
> > > > > >> >
> > > > > >> > Hudi/Copy-on-write storage would keep expanding your existing
> > > > parquet
> > > > > >> files
> > > > > >> > to reach the configured file size, once you set the small
> file size
> > > > > >> > config..
> > > > > >> >
> > > > > >> > For e.g: we at uber, write 1GB files this way.. to do that,
> you
> > > > could
> > > > > >> set
> > > > > >> > something like this.
> > > > > >> > http://hudi.apache.org/configurations.html#limitFileSize  =
> 1 *
> > > > 1024 *
> > > > > >> 1024
> > > > > >> > * 1024
> > > > > >> >
> http://hudi.apache.org/configurations.html#compactionSmallFileSize
> > > > =
> > > > > >> 900 *
> > > > > >> > 1024 * 1024
> > > > > >> >
> > > > > >> >
> > > > > >> > Please let me know if you have trouble achieving this. Also
> please
> > > > use
> > > > > >> the
> > > > > >> > insert operation (not bulk_insert) for this to work
> > > > > >> >
> > > > > >> >
> > > > > >> > Thanks
> > > > > >> > Vinoth
> > > > > >> >
> > > > > >> > On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
> > > > > >> > [email protected]> wrote:
> > > > > >> >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > On 2019/03/08 13:43:52, Vinoth Chandar 
> wrote:
> > > > > >> > > > Hi Rahul,
> > > > > >> > > >
> > > > > >> > > > you can try adding
> hoodie.parquet.small.file.limit=104857600, to
> > > > > >> your
> > > > > >> > > > property file to specify 100MB files. Note that this
> works only
> > > > if
> > > > > >> you
> > > > > >> > > are
> > > > > >> > > > using insert (not bulk_insert) operation. Hudi will
> enforce file
> > > > > >> sizing
> > > > > >> > > on
> > > > > >> > > > ingest time. As of now, there is no support for
> collapsing these
> > > > > >> file
> > > > > >> > > > groups (parquet + related log files) into a large file
> group
> > > > > >> (HIP/Design
> > > > > >> > > > may come soon). Does that help?
> > > > > >> > > >
> > > > > >> > > > Also on the compaction in general, since you don't have
> any
> > > > updates.
> > > > > >> > > > I think you can simply use the copy_on_write storage?
> inserts
> > > > will
> > > > > >> go to
> > > > > >> > > > the parquet file anyway on MOR..(but if you like to be
> able to
> > > > deal
> > > > > >> with
> > > > > >> > > > updates later, understand where you are going)
> > > > > >> > > >
> > > > > >> > > > Thanks
> > > > > >> > > > Vinoth
> > > > > >> > > >
> > > > > >> > > > On Fri, Mar 8, 2019 at 3:25 AM [email protected] <

Re: how to merge small parqut files in the hudi location

2019-04-03 Thread rahuledavalath



On 2019/03/13 12:57:59, [email protected]  
wrote: 
> 
> 
> On 2019/03/13 08:42:13, Vinoth Chandar  wrote: 
> > Hi Rahul,
> > 
> > Good to know. Yes for copy_on_write please turn off inline compaction.
> > (Probably explains why the default was false).
> > 
> > Thanks
> > Vinoth
> > 
> > On Wed, Mar 13, 2019 at 12:51 AM [email protected] <
> > [email protected]> wrote:
> > 
> > >
> > >
> > > On 2019/03/12 23:04:43, Vinoth Chandar  wrote:
> > > > Opened up https://github.com/uber/hudi/pull/599/files to improve this
> > > > out-of-box
> > > >
> > > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > Hi Rahul,
> > > > >
> > > > > The files you shared all belong to same file group (they share the 
> > > > > same
> > > > > prefix if you notice) (
> > > https://hudi.apache.org/concepts.html#terminologies
> > > > > ).
> > > > > Given its not creating new file groups every run, means the feature is
> > > > > kicking in.
> > > > >
> > > > > During each insert, Hudi will find the latest file in each file group
> > > (I,e
> > > > > the one with largest instant time, timestamp) and rewrite/expand that
> > > with
> > > > > the new inserts. Hudi does not clean up the old files immediately,
> > > since
> > > > > that can cause running queries to fail, since they could have started
> > > even
> > > > > hours ago (e.g Hive).
> > > > >
> > > > > If you want to reduce the number of files you see, you can lower
> > > number of
> > > > > commits retained
> > > > > https://hudi.apache.org/configurations.html#retainCommits
> > > > > We retain 24 by default.. i.e after the 25th file, the first one will
> > > be
> > > > > automatically cleaned..
> > > > >
> > > > > Does that make sense? Are you able to query this data and find the
> > > > > expected records?
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > > > On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
> > > > > [email protected]> wrote:
> > > > >
> > > > >>
> > > > >>
> > > > >> On 2019/03/11 18:25:46, Vinoth Chandar  wrote:
> > > > >> > Hi Rahul,
> > > > >> >
> > > > >> > Hudi/Copy-on-write storage would keep expanding your existing
> > > parquet
> > > > >> files
> > > > >> > to reach the configured file size, once you set the small file size
> > > > >> > config..
> > > > >> >
> > > > >> > For e.g: we at uber, write 1GB files this way.. to do that, you
> > > could
> > > > >> set
> > > > >> > something like this.
> > > > >> > http://hudi.apache.org/configurations.html#limitFileSize  = 1 *
> > > 1024 *
> > > > >> 1024
> > > > >> > * 1024
> > > > >> > http://hudi.apache.org/configurations.html#compactionSmallFileSize
> > > =
> > > > >> 900 *
> > > > >> > 1024 * 1024
> > > > >> >
> > > > >> >
> > > > >> > Please let me know if you have trouble achieving this. Also please
> > > use
> > > > >> the
> > > > >> > insert operation (not bulk_insert) for this to work
> > > > >> >
> > > > >> >
> > > > >> > Thanks
> > > > >> > Vinoth
> > > > >> >
> > > > >> > On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
> > > > >> > [email protected]> wrote:
> > > > >> >
> > > > >> > >
> > > > >> > >
> > > > >> > > On 2019/03/08 13:43:52, Vinoth Chandar  wrote:
> > > > >> > > > Hi Rahul,
> > > > >> > > >
> > > > >> > > > you can try adding hoodie.parquet.small.file.limit=104857600, 
> > > > >> > > > to
> > > > >> your
> > > > >> > > > property file to specify 100MB files. Note that this works only
> > > if
> > > > >> you
> > > > >> > > are
> > > > >> > > > using insert (not bulk_insert) operation. Hudi will enforce 
> > > > >> > > > file
> > > > >> sizing
> > > > >> > > on
> > > > >> > > > ingest time. As of now, there is no support for collapsing 
> > > > >> > > > these
> > > > >> file
> > > > >> > > > groups (parquet + related log files) into a large file group
> > > > >> (HIP/Design
> > > > >> > > > may come soon). Does that help?
> > > > >> > > >
> > > > >> > > > Also on the compaction in general, since you don't have any
> > > updates.
> > > > >> > > > I think you can simply use the copy_on_write storage? inserts
> > > will
> > > > >> go to
> > > > >> > > > the parquet file anyway on MOR..(but if you like to be able to
> > > deal
> > > > >> with
> > > > >> > > > updates later, understand where you are going)
> > > > >> > > >
> > > > >> > > > Thanks
> > > > >> > > > Vinoth
> > > > >> > > >
> > > > >> > > > On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
> > > > >> > > > [email protected]> wrote:
> > > > >> > > >
> > > > >> > > > > Dear All
> > > > >> > > > >
> > > > >> > > > > I am using DeltaStreamer to stream the data from kafka topic
> > > and
> > > > >> to
> > > > >> > > write
> > > > >> > > > > it into the hudi data set.
> > > > >> > > > > For this use case I am not doing any upsert all are insert
> > > only
> > > > >> so each
> > > > >> > > > > job creates new parquet file after the inject job. So  large
> > > > >> number of
> > > > >> > > > > small files are creating. how c

Re: how to merge small parqut files in the hudi location

2019-03-13 Thread Vinoth Chandar
Another quick check. Are all 180 files part of the same file group i.e
begin with the same uuid prefix in its name?

On Wed, Mar 13, 2019 at 7:14 AM Vinoth Chandar  wrote:

> Hi rahul,
>
> From the timeline, it does seem line cleaning happens regularly.  Can you
> share the logs from the driver in a gist?
>
> Thanks
> Vinoth
>
> On Wed, Mar 13, 2019 at 5:58 AM [email protected] <
> [email protected]> wrote:
>
>>
>>
>> On 2019/03/13 08:42:13, Vinoth Chandar  wrote:
>> > Hi Rahul,
>> >
>> > Good to know. Yes for copy_on_write please turn off inline compaction.
>> > (Probably explains why the default was false).
>> >
>> > Thanks
>> > Vinoth
>> >
>> > On Wed, Mar 13, 2019 at 12:51 AM [email protected] <
>> > [email protected]> wrote:
>> >
>> > >
>> > >
>> > > On 2019/03/12 23:04:43, Vinoth Chandar  wrote:
>> > > > Opened up https://github.com/uber/hudi/pull/599/files to improve
>> this
>> > > > out-of-box
>> > > >
>> > > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar 
>> > > wrote:
>> > > >
>> > > > > Hi Rahul,
>> > > > >
>> > > > > The files you shared all belong to same file group (they share
>> the same
>> > > > > prefix if you notice) (
>> > > https://hudi.apache.org/concepts.html#terminologies
>> > > > > ).
>> > > > > Given its not creating new file groups every run, means the
>> feature is
>> > > > > kicking in.
>> > > > >
>> > > > > During each insert, Hudi will find the latest file in each file
>> group
>> > > (I,e
>> > > > > the one with largest instant time, timestamp) and rewrite/expand
>> that
>> > > with
>> > > > > the new inserts. Hudi does not clean up the old files immediately,
>> > > since
>> > > > > that can cause running queries to fail, since they could have
>> started
>> > > even
>> > > > > hours ago (e.g Hive).
>> > > > >
>> > > > > If you want to reduce the number of files you see, you can lower
>> > > number of
>> > > > > commits retained
>> > > > > https://hudi.apache.org/configurations.html#retainCommits
>> > > > > We retain 24 by default.. i.e after the 25th file, the first one
>> will
>> > > be
>> > > > > automatically cleaned..
>> > > > >
>> > > > > Does that make sense? Are you able to query this data and find the
>> > > > > expected records?
>> > > > >
>> > > > > Thanks
>> > > > > Vinoth
>> > > > >
>> > > > > On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
>> > > > > [email protected]> wrote:
>> > > > >
>> > > > >>
>> > > > >>
>> > > > >> On 2019/03/11 18:25:46, Vinoth Chandar 
>> wrote:
>> > > > >> > Hi Rahul,
>> > > > >> >
>> > > > >> > Hudi/Copy-on-write storage would keep expanding your existing
>> > > parquet
>> > > > >> files
>> > > > >> > to reach the configured file size, once you set the small file
>> size
>> > > > >> > config..
>> > > > >> >
>> > > > >> > For e.g: we at uber, write 1GB files this way.. to do that, you
>> > > could
>> > > > >> set
>> > > > >> > something like this.
>> > > > >> > http://hudi.apache.org/configurations.html#limitFileSize  = 1
>> *
>> > > 1024 *
>> > > > >> 1024
>> > > > >> > * 1024
>> > > > >> >
>> http://hudi.apache.org/configurations.html#compactionSmallFileSize
>> > > =
>> > > > >> 900 *
>> > > > >> > 1024 * 1024
>> > > > >> >
>> > > > >> >
>> > > > >> > Please let me know if you have trouble achieving this. Also
>> please
>> > > use
>> > > > >> the
>> > > > >> > insert operation (not bulk_insert) for this to work
>> > > > >> >
>> > > > >> >
>> > > > >> > Thanks
>> > > > >> > Vinoth
>> > > > >> >
>> > > > >> > On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
>> > > > >> > [email protected]> wrote:
>> > > > >> >
>> > > > >> > >
>> > > > >> > >
>> > > > >> > > On 2019/03/08 13:43:52, Vinoth Chandar 
>> wrote:
>> > > > >> > > > Hi Rahul,
>> > > > >> > > >
>> > > > >> > > > you can try adding
>> hoodie.parquet.small.file.limit=104857600, to
>> > > > >> your
>> > > > >> > > > property file to specify 100MB files. Note that this works
>> only
>> > > if
>> > > > >> you
>> > > > >> > > are
>> > > > >> > > > using insert (not bulk_insert) operation. Hudi will
>> enforce file
>> > > > >> sizing
>> > > > >> > > on
>> > > > >> > > > ingest time. As of now, there is no support for collapsing
>> these
>> > > > >> file
>> > > > >> > > > groups (parquet + related log files) into a large file
>> group
>> > > > >> (HIP/Design
>> > > > >> > > > may come soon). Does that help?
>> > > > >> > > >
>> > > > >> > > > Also on the compaction in general, since you don't have any
>> > > updates.
>> > > > >> > > > I think you can simply use the copy_on_write storage?
>> inserts
>> > > will
>> > > > >> go to
>> > > > >> > > > the parquet file anyway on MOR..(but if you like to be
>> able to
>> > > deal
>> > > > >> with
>> > > > >> > > > updates later, understand where you are going)
>> > > > >> > > >
>> > > > >> > > > Thanks
>> > > > >> > > > Vinoth
>> > > > >> > > >
>> > > > >> > > > On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
>> > > > >> > > > [email protected]> wrote:
>> > > > >

Re: how to merge small parqut files in the hudi location

2019-03-13 Thread Vinoth Chandar
Hi rahul,

>From the timeline, it does seem line cleaning happens regularly.  Can you
share the logs from the driver in a gist?

Thanks
Vinoth

On Wed, Mar 13, 2019 at 5:58 AM [email protected] <
[email protected]> wrote:

>
>
> On 2019/03/13 08:42:13, Vinoth Chandar  wrote:
> > Hi Rahul,
> >
> > Good to know. Yes for copy_on_write please turn off inline compaction.
> > (Probably explains why the default was false).
> >
> > Thanks
> > Vinoth
> >
> > On Wed, Mar 13, 2019 at 12:51 AM [email protected] <
> > [email protected]> wrote:
> >
> > >
> > >
> > > On 2019/03/12 23:04:43, Vinoth Chandar  wrote:
> > > > Opened up https://github.com/uber/hudi/pull/599/files to improve
> this
> > > > out-of-box
> > > >
> > > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > Hi Rahul,
> > > > >
> > > > > The files you shared all belong to same file group (they share the
> same
> > > > > prefix if you notice) (
> > > https://hudi.apache.org/concepts.html#terminologies
> > > > > ).
> > > > > Given its not creating new file groups every run, means the
> feature is
> > > > > kicking in.
> > > > >
> > > > > During each insert, Hudi will find the latest file in each file
> group
> > > (I,e
> > > > > the one with largest instant time, timestamp) and rewrite/expand
> that
> > > with
> > > > > the new inserts. Hudi does not clean up the old files immediately,
> > > since
> > > > > that can cause running queries to fail, since they could have
> started
> > > even
> > > > > hours ago (e.g Hive).
> > > > >
> > > > > If you want to reduce the number of files you see, you can lower
> > > number of
> > > > > commits retained
> > > > > https://hudi.apache.org/configurations.html#retainCommits
> > > > > We retain 24 by default.. i.e after the 25th file, the first one
> will
> > > be
> > > > > automatically cleaned..
> > > > >
> > > > > Does that make sense? Are you able to query this data and find the
> > > > > expected records?
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > > > On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
> > > > > [email protected]> wrote:
> > > > >
> > > > >>
> > > > >>
> > > > >> On 2019/03/11 18:25:46, Vinoth Chandar  wrote:
> > > > >> > Hi Rahul,
> > > > >> >
> > > > >> > Hudi/Copy-on-write storage would keep expanding your existing
> > > parquet
> > > > >> files
> > > > >> > to reach the configured file size, once you set the small file
> size
> > > > >> > config..
> > > > >> >
> > > > >> > For e.g: we at uber, write 1GB files this way.. to do that, you
> > > could
> > > > >> set
> > > > >> > something like this.
> > > > >> > http://hudi.apache.org/configurations.html#limitFileSize  = 1 *
> > > 1024 *
> > > > >> 1024
> > > > >> > * 1024
> > > > >> >
> http://hudi.apache.org/configurations.html#compactionSmallFileSize
> > > =
> > > > >> 900 *
> > > > >> > 1024 * 1024
> > > > >> >
> > > > >> >
> > > > >> > Please let me know if you have trouble achieving this. Also
> please
> > > use
> > > > >> the
> > > > >> > insert operation (not bulk_insert) for this to work
> > > > >> >
> > > > >> >
> > > > >> > Thanks
> > > > >> > Vinoth
> > > > >> >
> > > > >> > On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
> > > > >> > [email protected]> wrote:
> > > > >> >
> > > > >> > >
> > > > >> > >
> > > > >> > > On 2019/03/08 13:43:52, Vinoth Chandar 
> wrote:
> > > > >> > > > Hi Rahul,
> > > > >> > > >
> > > > >> > > > you can try adding
> hoodie.parquet.small.file.limit=104857600, to
> > > > >> your
> > > > >> > > > property file to specify 100MB files. Note that this works
> only
> > > if
> > > > >> you
> > > > >> > > are
> > > > >> > > > using insert (not bulk_insert) operation. Hudi will enforce
> file
> > > > >> sizing
> > > > >> > > on
> > > > >> > > > ingest time. As of now, there is no support for collapsing
> these
> > > > >> file
> > > > >> > > > groups (parquet + related log files) into a large file group
> > > > >> (HIP/Design
> > > > >> > > > may come soon). Does that help?
> > > > >> > > >
> > > > >> > > > Also on the compaction in general, since you don't have any
> > > updates.
> > > > >> > > > I think you can simply use the copy_on_write storage?
> inserts
> > > will
> > > > >> go to
> > > > >> > > > the parquet file anyway on MOR..(but if you like to be able
> to
> > > deal
> > > > >> with
> > > > >> > > > updates later, understand where you are going)
> > > > >> > > >
> > > > >> > > > Thanks
> > > > >> > > > Vinoth
> > > > >> > > >
> > > > >> > > > On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
> > > > >> > > > [email protected]> wrote:
> > > > >> > > >
> > > > >> > > > > Dear All
> > > > >> > > > >
> > > > >> > > > > I am using DeltaStreamer to stream the data from kafka
> topic
> > > and
> > > > >> to
> > > > >> > > write
> > > > >> > > > > it into the hudi data set.
> > > > >> > > > > For this use case I am not doing any upsert all are insert
> > > only
> > > > >> so each
> > > > >> >

Re: how to merge small parqut files in the hudi location

2019-03-13 Thread rahuledavalath



On 2019/03/13 08:42:13, Vinoth Chandar  wrote: 
> Hi Rahul,
> 
> Good to know. Yes for copy_on_write please turn off inline compaction.
> (Probably explains why the default was false).
> 
> Thanks
> Vinoth
> 
> On Wed, Mar 13, 2019 at 12:51 AM [email protected] <
> [email protected]> wrote:
> 
> >
> >
> > On 2019/03/12 23:04:43, Vinoth Chandar  wrote:
> > > Opened up https://github.com/uber/hudi/pull/599/files to improve this
> > > out-of-box
> > >
> > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar 
> > wrote:
> > >
> > > > Hi Rahul,
> > > >
> > > > The files you shared all belong to same file group (they share the same
> > > > prefix if you notice) (
> > https://hudi.apache.org/concepts.html#terminologies
> > > > ).
> > > > Given its not creating new file groups every run, means the feature is
> > > > kicking in.
> > > >
> > > > During each insert, Hudi will find the latest file in each file group
> > (I,e
> > > > the one with largest instant time, timestamp) and rewrite/expand that
> > with
> > > > the new inserts. Hudi does not clean up the old files immediately,
> > since
> > > > that can cause running queries to fail, since they could have started
> > even
> > > > hours ago (e.g Hive).
> > > >
> > > > If you want to reduce the number of files you see, you can lower
> > number of
> > > > commits retained
> > > > https://hudi.apache.org/configurations.html#retainCommits
> > > > We retain 24 by default.. i.e after the 25th file, the first one will
> > be
> > > > automatically cleaned..
> > > >
> > > > Does that make sense? Are you able to query this data and find the
> > > > expected records?
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
> > > > [email protected]> wrote:
> > > >
> > > >>
> > > >>
> > > >> On 2019/03/11 18:25:46, Vinoth Chandar  wrote:
> > > >> > Hi Rahul,
> > > >> >
> > > >> > Hudi/Copy-on-write storage would keep expanding your existing
> > parquet
> > > >> files
> > > >> > to reach the configured file size, once you set the small file size
> > > >> > config..
> > > >> >
> > > >> > For e.g: we at uber, write 1GB files this way.. to do that, you
> > could
> > > >> set
> > > >> > something like this.
> > > >> > http://hudi.apache.org/configurations.html#limitFileSize  = 1 *
> > 1024 *
> > > >> 1024
> > > >> > * 1024
> > > >> > http://hudi.apache.org/configurations.html#compactionSmallFileSize
> > =
> > > >> 900 *
> > > >> > 1024 * 1024
> > > >> >
> > > >> >
> > > >> > Please let me know if you have trouble achieving this. Also please
> > use
> > > >> the
> > > >> > insert operation (not bulk_insert) for this to work
> > > >> >
> > > >> >
> > > >> > Thanks
> > > >> > Vinoth
> > > >> >
> > > >> > On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
> > > >> > [email protected]> wrote:
> > > >> >
> > > >> > >
> > > >> > >
> > > >> > > On 2019/03/08 13:43:52, Vinoth Chandar  wrote:
> > > >> > > > Hi Rahul,
> > > >> > > >
> > > >> > > > you can try adding hoodie.parquet.small.file.limit=104857600, to
> > > >> your
> > > >> > > > property file to specify 100MB files. Note that this works only
> > if
> > > >> you
> > > >> > > are
> > > >> > > > using insert (not bulk_insert) operation. Hudi will enforce file
> > > >> sizing
> > > >> > > on
> > > >> > > > ingest time. As of now, there is no support for collapsing these
> > > >> file
> > > >> > > > groups (parquet + related log files) into a large file group
> > > >> (HIP/Design
> > > >> > > > may come soon). Does that help?
> > > >> > > >
> > > >> > > > Also on the compaction in general, since you don't have any
> > updates.
> > > >> > > > I think you can simply use the copy_on_write storage? inserts
> > will
> > > >> go to
> > > >> > > > the parquet file anyway on MOR..(but if you like to be able to
> > deal
> > > >> with
> > > >> > > > updates later, understand where you are going)
> > > >> > > >
> > > >> > > > Thanks
> > > >> > > > Vinoth
> > > >> > > >
> > > >> > > > On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
> > > >> > > > [email protected]> wrote:
> > > >> > > >
> > > >> > > > > Dear All
> > > >> > > > >
> > > >> > > > > I am using DeltaStreamer to stream the data from kafka topic
> > and
> > > >> to
> > > >> > > write
> > > >> > > > > it into the hudi data set.
> > > >> > > > > For this use case I am not doing any upsert all are insert
> > only
> > > >> so each
> > > >> > > > > job creates new parquet file after the inject job. So  large
> > > >> number of
> > > >> > > > > small files are creating. how can i  merge these files from
> > > >> > > deltastreamer
> > > >> > > > > job using the available configurations.
> > > >> > > > >
> > > >> > > > > I think compactionSmallFileSize may useful for this case,
> > but i
> > > >> am not
> > > >> > > > > sure whether it is for deltastreamer or not. I tried it in
> > > >> > > deltastreamer
> > > >> > > > > but it did't worked. Please assist on this. If possible give
> > 

Re: how to merge small parqut files in the hudi location

2019-03-13 Thread Vinoth Chandar
Hi Rahul,

Good to know. Yes for copy_on_write please turn off inline compaction.
(Probably explains why the default was false).

Thanks
Vinoth

On Wed, Mar 13, 2019 at 12:51 AM [email protected] <
[email protected]> wrote:

>
>
> On 2019/03/12 23:04:43, Vinoth Chandar  wrote:
> > Opened up https://github.com/uber/hudi/pull/599/files to improve this
> > out-of-box
> >
> > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar 
> wrote:
> >
> > > Hi Rahul,
> > >
> > > The files you shared all belong to same file group (they share the same
> > > prefix if you notice) (
> https://hudi.apache.org/concepts.html#terminologies
> > > ).
> > > Given its not creating new file groups every run, means the feature is
> > > kicking in.
> > >
> > > During each insert, Hudi will find the latest file in each file group
> (I,e
> > > the one with largest instant time, timestamp) and rewrite/expand that
> with
> > > the new inserts. Hudi does not clean up the old files immediately,
> since
> > > that can cause running queries to fail, since they could have started
> even
> > > hours ago (e.g Hive).
> > >
> > > If you want to reduce the number of files you see, you can lower
> number of
> > > commits retained
> > > https://hudi.apache.org/configurations.html#retainCommits
> > > We retain 24 by default.. i.e after the 25th file, the first one will
> be
> > > automatically cleaned..
> > >
> > > Does that make sense? Are you able to query this data and find the
> > > expected records?
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
> > > [email protected]> wrote:
> > >
> > >>
> > >>
> > >> On 2019/03/11 18:25:46, Vinoth Chandar  wrote:
> > >> > Hi Rahul,
> > >> >
> > >> > Hudi/Copy-on-write storage would keep expanding your existing
> parquet
> > >> files
> > >> > to reach the configured file size, once you set the small file size
> > >> > config..
> > >> >
> > >> > For e.g: we at uber, write 1GB files this way.. to do that, you
> could
> > >> set
> > >> > something like this.
> > >> > http://hudi.apache.org/configurations.html#limitFileSize  = 1 *
> 1024 *
> > >> 1024
> > >> > * 1024
> > >> > http://hudi.apache.org/configurations.html#compactionSmallFileSize
> =
> > >> 900 *
> > >> > 1024 * 1024
> > >> >
> > >> >
> > >> > Please let me know if you have trouble achieving this. Also please
> use
> > >> the
> > >> > insert operation (not bulk_insert) for this to work
> > >> >
> > >> >
> > >> > Thanks
> > >> > Vinoth
> > >> >
> > >> > On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
> > >> > [email protected]> wrote:
> > >> >
> > >> > >
> > >> > >
> > >> > > On 2019/03/08 13:43:52, Vinoth Chandar  wrote:
> > >> > > > Hi Rahul,
> > >> > > >
> > >> > > > you can try adding hoodie.parquet.small.file.limit=104857600, to
> > >> your
> > >> > > > property file to specify 100MB files. Note that this works only
> if
> > >> you
> > >> > > are
> > >> > > > using insert (not bulk_insert) operation. Hudi will enforce file
> > >> sizing
> > >> > > on
> > >> > > > ingest time. As of now, there is no support for collapsing these
> > >> file
> > >> > > > groups (parquet + related log files) into a large file group
> > >> (HIP/Design
> > >> > > > may come soon). Does that help?
> > >> > > >
> > >> > > > Also on the compaction in general, since you don't have any
> updates.
> > >> > > > I think you can simply use the copy_on_write storage? inserts
> will
> > >> go to
> > >> > > > the parquet file anyway on MOR..(but if you like to be able to
> deal
> > >> with
> > >> > > > updates later, understand where you are going)
> > >> > > >
> > >> > > > Thanks
> > >> > > > Vinoth
> > >> > > >
> > >> > > > On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
> > >> > > > [email protected]> wrote:
> > >> > > >
> > >> > > > > Dear All
> > >> > > > >
> > >> > > > > I am using DeltaStreamer to stream the data from kafka topic
> and
> > >> to
> > >> > > write
> > >> > > > > it into the hudi data set.
> > >> > > > > For this use case I am not doing any upsert all are insert
> only
> > >> so each
> > >> > > > > job creates new parquet file after the inject job. So  large
> > >> number of
> > >> > > > > small files are creating. how can i  merge these files from
> > >> > > deltastreamer
> > >> > > > > job using the available configurations.
> > >> > > > >
> > >> > > > > I think compactionSmallFileSize may useful for this case,
> but i
> > >> am not
> > >> > > > > sure whether it is for deltastreamer or not. I tried it in
> > >> > > deltastreamer
> > >> > > > > but it did't worked. Please assist on this. If possible give
> one
> > >> > > example
> > >> > > > > for the same
> > >> > > > >
> > >> > > > > Thanks & Regards
> > >> > > > > Rahul
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > > Dear Vinoth
> > >> > >
> > >> > > For one of my use case , I doing only inserts.For testing i am
> > >> inserting
> > >> > > data which have 5-10 records only. I  am continuou

Re: how to merge small parqut files in the hudi location

2019-03-13 Thread rahuledavalath



On 2019/03/12 23:04:43, Vinoth Chandar  wrote: 
> Opened up https://github.com/uber/hudi/pull/599/files to improve this
> out-of-box
> 
> On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar  wrote:
> 
> > Hi Rahul,
> >
> > The files you shared all belong to same file group (they share the same
> > prefix if you notice) (https://hudi.apache.org/concepts.html#terminologies
> > ).
> > Given its not creating new file groups every run, means the feature is
> > kicking in.
> >
> > During each insert, Hudi will find the latest file in each file group (I,e
> > the one with largest instant time, timestamp) and rewrite/expand that with
> > the new inserts. Hudi does not clean up the old files immediately, since
> > that can cause running queries to fail, since they could have started even
> > hours ago (e.g Hive).
> >
> > If you want to reduce the number of files you see, you can lower number of
> > commits retained
> > https://hudi.apache.org/configurations.html#retainCommits
> > We retain 24 by default.. i.e after the 25th file, the first one will be
> > automatically cleaned..
> >
> > Does that make sense? Are you able to query this data and find the
> > expected records?
> >
> > Thanks
> > Vinoth
> >
> > On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
> > [email protected]> wrote:
> >
> >>
> >>
> >> On 2019/03/11 18:25:46, Vinoth Chandar  wrote:
> >> > Hi Rahul,
> >> >
> >> > Hudi/Copy-on-write storage would keep expanding your existing parquet
> >> files
> >> > to reach the configured file size, once you set the small file size
> >> > config..
> >> >
> >> > For e.g: we at uber, write 1GB files this way.. to do that, you could
> >> set
> >> > something like this.
> >> > http://hudi.apache.org/configurations.html#limitFileSize  = 1 * 1024 *
> >> 1024
> >> > * 1024
> >> > http://hudi.apache.org/configurations.html#compactionSmallFileSize =
> >> 900 *
> >> > 1024 * 1024
> >> >
> >> >
> >> > Please let me know if you have trouble achieving this. Also please use
> >> the
> >> > insert operation (not bulk_insert) for this to work
> >> >
> >> >
> >> > Thanks
> >> > Vinoth
> >> >
> >> > On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
> >> > [email protected]> wrote:
> >> >
> >> > >
> >> > >
> >> > > On 2019/03/08 13:43:52, Vinoth Chandar  wrote:
> >> > > > Hi Rahul,
> >> > > >
> >> > > > you can try adding hoodie.parquet.small.file.limit=104857600, to
> >> your
> >> > > > property file to specify 100MB files. Note that this works only if
> >> you
> >> > > are
> >> > > > using insert (not bulk_insert) operation. Hudi will enforce file
> >> sizing
> >> > > on
> >> > > > ingest time. As of now, there is no support for collapsing these
> >> file
> >> > > > groups (parquet + related log files) into a large file group
> >> (HIP/Design
> >> > > > may come soon). Does that help?
> >> > > >
> >> > > > Also on the compaction in general, since you don't have any updates.
> >> > > > I think you can simply use the copy_on_write storage? inserts will
> >> go to
> >> > > > the parquet file anyway on MOR..(but if you like to be able to deal
> >> with
> >> > > > updates later, understand where you are going)
> >> > > >
> >> > > > Thanks
> >> > > > Vinoth
> >> > > >
> >> > > > On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
> >> > > > [email protected]> wrote:
> >> > > >
> >> > > > > Dear All
> >> > > > >
> >> > > > > I am using DeltaStreamer to stream the data from kafka topic and
> >> to
> >> > > write
> >> > > > > it into the hudi data set.
> >> > > > > For this use case I am not doing any upsert all are insert only
> >> so each
> >> > > > > job creates new parquet file after the inject job. So  large
> >> number of
> >> > > > > small files are creating. how can i  merge these files from
> >> > > deltastreamer
> >> > > > > job using the available configurations.
> >> > > > >
> >> > > > > I think compactionSmallFileSize may useful for this case,  but i
> >> am not
> >> > > > > sure whether it is for deltastreamer or not. I tried it in
> >> > > deltastreamer
> >> > > > > but it did't worked. Please assist on this. If possible give one
> >> > > example
> >> > > > > for the same
> >> > > > >
> >> > > > > Thanks & Regards
> >> > > > > Rahul
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > > Dear Vinoth
> >> > >
> >> > > For one of my use case , I doing only inserts.For testing i am
> >> inserting
> >> > > data which have 5-10 records only. I  am continuously pushing data to
> >> hudi
> >> > > dataset. As it is insert only for every insert it's creating  new
> >> small
> >> > > files to the dataset.
> >> > >
> >> > > If my insertion interval is less and i am planning for data to keep
> >> for
> >> > > years, this flow will create lots of small files.
> >> > > I just want to know whether hudi can merge these small files in any
> >> ways.
> >> > >
> >> > >
> >> > > Thanks & Regards
> >> > > Rahul P
> >> > >
> >> > >
> >> >
> >>
> >> Dear Vinoth
> >>
> >> I tried below configurations.
> >>
>

Re: how to merge small parqut files in the hudi location

2019-03-12 Thread Vinoth Chandar
Opened up https://github.com/uber/hudi/pull/599/files to improve this
out-of-box

On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar  wrote:

> Hi Rahul,
>
> The files you shared all belong to same file group (they share the same
> prefix if you notice) (https://hudi.apache.org/concepts.html#terminologies
> ).
> Given its not creating new file groups every run, means the feature is
> kicking in.
>
> During each insert, Hudi will find the latest file in each file group (I,e
> the one with largest instant time, timestamp) and rewrite/expand that with
> the new inserts. Hudi does not clean up the old files immediately, since
> that can cause running queries to fail, since they could have started even
> hours ago (e.g Hive).
>
> If you want to reduce the number of files you see, you can lower number of
> commits retained
> https://hudi.apache.org/configurations.html#retainCommits
> We retain 24 by default.. i.e after the 25th file, the first one will be
> automatically cleaned..
>
> Does that make sense? Are you able to query this data and find the
> expected records?
>
> Thanks
> Vinoth
>
> On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
> [email protected]> wrote:
>
>>
>>
>> On 2019/03/11 18:25:46, Vinoth Chandar  wrote:
>> > Hi Rahul,
>> >
>> > Hudi/Copy-on-write storage would keep expanding your existing parquet
>> files
>> > to reach the configured file size, once you set the small file size
>> > config..
>> >
>> > For e.g: we at uber, write 1GB files this way.. to do that, you could
>> set
>> > something like this.
>> > http://hudi.apache.org/configurations.html#limitFileSize  = 1 * 1024 *
>> 1024
>> > * 1024
>> > http://hudi.apache.org/configurations.html#compactionSmallFileSize =
>> 900 *
>> > 1024 * 1024
>> >
>> >
>> > Please let me know if you have trouble achieving this. Also please use
>> the
>> > insert operation (not bulk_insert) for this to work
>> >
>> >
>> > Thanks
>> > Vinoth
>> >
>> > On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
>> > [email protected]> wrote:
>> >
>> > >
>> > >
>> > > On 2019/03/08 13:43:52, Vinoth Chandar  wrote:
>> > > > Hi Rahul,
>> > > >
>> > > > you can try adding hoodie.parquet.small.file.limit=104857600, to
>> your
>> > > > property file to specify 100MB files. Note that this works only if
>> you
>> > > are
>> > > > using insert (not bulk_insert) operation. Hudi will enforce file
>> sizing
>> > > on
>> > > > ingest time. As of now, there is no support for collapsing these
>> file
>> > > > groups (parquet + related log files) into a large file group
>> (HIP/Design
>> > > > may come soon). Does that help?
>> > > >
>> > > > Also on the compaction in general, since you don't have any updates.
>> > > > I think you can simply use the copy_on_write storage? inserts will
>> go to
>> > > > the parquet file anyway on MOR..(but if you like to be able to deal
>> with
>> > > > updates later, understand where you are going)
>> > > >
>> > > > Thanks
>> > > > Vinoth
>> > > >
>> > > > On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
>> > > > [email protected]> wrote:
>> > > >
>> > > > > Dear All
>> > > > >
>> > > > > I am using DeltaStreamer to stream the data from kafka topic and
>> to
>> > > write
>> > > > > it into the hudi data set.
>> > > > > For this use case I am not doing any upsert all are insert only
>> so each
>> > > > > job creates new parquet file after the inject job. So  large
>> number of
>> > > > > small files are creating. how can i  merge these files from
>> > > deltastreamer
>> > > > > job using the available configurations.
>> > > > >
>> > > > > I think compactionSmallFileSize may useful for this case,  but i
>> am not
>> > > > > sure whether it is for deltastreamer or not. I tried it in
>> > > deltastreamer
>> > > > > but it did't worked. Please assist on this. If possible give one
>> > > example
>> > > > > for the same
>> > > > >
>> > > > > Thanks & Regards
>> > > > > Rahul
>> > > > >
>> > > >
>> > >
>> > >
>> > > Dear Vinoth
>> > >
>> > > For one of my use case , I doing only inserts.For testing i am
>> inserting
>> > > data which have 5-10 records only. I  am continuously pushing data to
>> hudi
>> > > dataset. As it is insert only for every insert it's creating  new
>> small
>> > > files to the dataset.
>> > >
>> > > If my insertion interval is less and i am planning for data to keep
>> for
>> > > years, this flow will create lots of small files.
>> > > I just want to know whether hudi can merge these small files in any
>> ways.
>> > >
>> > >
>> > > Thanks & Regards
>> > > Rahul P
>> > >
>> > >
>> >
>>
>> Dear Vinoth
>>
>> I tried below configurations.
>>
>> hoodie.parquet.max.file.size=1073741824
>> hoodie.parquet.small.file.limit=943718400
>>
>> I am using below code for inserting data from json kafka source.
>>
>> spark-submit --class
>> com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
>> hoodie-utilities-0.4.5.jar --storage-type COPY_ON_WRITE --source-class
>> com.uber.hoodie.utilities.source

Re: how to merge small parqut files in the hudi location

2019-03-12 Thread Vinoth Chandar
Hi Rahul,

The files you shared all belong to same file group (they share the same
prefix if you notice) (https://hudi.apache.org/concepts.html#terminologies
).
Given its not creating new file groups every run, means the feature is
kicking in.

During each insert, Hudi will find the latest file in each file group (I,e
the one with largest instant time, timestamp) and rewrite/expand that with
the new inserts. Hudi does not clean up the old files immediately, since
that can cause running queries to fail, since they could have started even
hours ago (e.g Hive).

If you want to reduce the number of files you see, you can lower number of
commits retained
https://hudi.apache.org/configurations.html#retainCommits
We retain 24 by default.. i.e after the 25th file, the first one will be
automatically cleaned..

Does that make sense? Are you able to query this data and find the expected
records?

Thanks
Vinoth

On Tue, Mar 12, 2019 at 12:23 PM [email protected] <
[email protected]> wrote:

>
>
> On 2019/03/11 18:25:46, Vinoth Chandar  wrote:
> > Hi Rahul,
> >
> > Hudi/Copy-on-write storage would keep expanding your existing parquet
> files
> > to reach the configured file size, once you set the small file size
> > config..
> >
> > For e.g: we at uber, write 1GB files this way.. to do that, you could set
> > something like this.
> > http://hudi.apache.org/configurations.html#limitFileSize  = 1 * 1024 *
> 1024
> > * 1024
> > http://hudi.apache.org/configurations.html#compactionSmallFileSize =
> 900 *
> > 1024 * 1024
> >
> >
> > Please let me know if you have trouble achieving this. Also please use
> the
> > insert operation (not bulk_insert) for this to work
> >
> >
> > Thanks
> > Vinoth
> >
> > On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
> > [email protected]> wrote:
> >
> > >
> > >
> > > On 2019/03/08 13:43:52, Vinoth Chandar  wrote:
> > > > Hi Rahul,
> > > >
> > > > you can try adding hoodie.parquet.small.file.limit=104857600, to your
> > > > property file to specify 100MB files. Note that this works only if
> you
> > > are
> > > > using insert (not bulk_insert) operation. Hudi will enforce file
> sizing
> > > on
> > > > ingest time. As of now, there is no support for collapsing these file
> > > > groups (parquet + related log files) into a large file group
> (HIP/Design
> > > > may come soon). Does that help?
> > > >
> > > > Also on the compaction in general, since you don't have any updates.
> > > > I think you can simply use the copy_on_write storage? inserts will
> go to
> > > > the parquet file anyway on MOR..(but if you like to be able to deal
> with
> > > > updates later, understand where you are going)
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
> > > > [email protected]> wrote:
> > > >
> > > > > Dear All
> > > > >
> > > > > I am using DeltaStreamer to stream the data from kafka topic and to
> > > write
> > > > > it into the hudi data set.
> > > > > For this use case I am not doing any upsert all are insert only so
> each
> > > > > job creates new parquet file after the inject job. So  large
> number of
> > > > > small files are creating. how can i  merge these files from
> > > deltastreamer
> > > > > job using the available configurations.
> > > > >
> > > > > I think compactionSmallFileSize may useful for this case,  but i
> am not
> > > > > sure whether it is for deltastreamer or not. I tried it in
> > > deltastreamer
> > > > > but it did't worked. Please assist on this. If possible give one
> > > example
> > > > > for the same
> > > > >
> > > > > Thanks & Regards
> > > > > Rahul
> > > > >
> > > >
> > >
> > >
> > > Dear Vinoth
> > >
> > > For one of my use case , I doing only inserts.For testing i am
> inserting
> > > data which have 5-10 records only. I  am continuously pushing data to
> hudi
> > > dataset. As it is insert only for every insert it's creating  new small
> > > files to the dataset.
> > >
> > > If my insertion interval is less and i am planning for data to keep for
> > > years, this flow will create lots of small files.
> > > I just want to know whether hudi can merge these small files in any
> ways.
> > >
> > >
> > > Thanks & Regards
> > > Rahul P
> > >
> > >
> >
>
> Dear Vinoth
>
> I tried below configurations.
>
> hoodie.parquet.max.file.size=1073741824
> hoodie.parquet.small.file.limit=943718400
>
> I am using below code for inserting data from json kafka source.
>
> spark-submit --class
> com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
> hoodie-utilities-0.4.5.jar --storage-type COPY_ON_WRITE --source-class
> com.uber.hoodie.utilities.sources.JsonKafkaSource  --source-ordering-field
> stype  --target-base-path /MERGE --target-table MERGE --props
> /hudi/kafka-source.properties  --schemaprovider-class
> com.uber.hoodie.utilities.schema.FilebasedSchemaProvider --op insert
>
> But for each insert job it's creating new parquet file. It's not touching
> old parquet fil

Re: how to merge small parqut files in the hudi location

2019-03-12 Thread rahuledavalath



On 2019/03/11 18:25:46, Vinoth Chandar  wrote: 
> Hi Rahul,
> 
> Hudi/Copy-on-write storage would keep expanding your existing parquet files
> to reach the configured file size, once you set the small file size
> config..
> 
> For e.g: we at uber, write 1GB files this way.. to do that, you could set
> something like this.
> http://hudi.apache.org/configurations.html#limitFileSize  = 1 * 1024 * 1024
> * 1024
> http://hudi.apache.org/configurations.html#compactionSmallFileSize = 900 *
> 1024 * 1024
> 
> 
> Please let me know if you have trouble achieving this. Also please use the
> insert operation (not bulk_insert) for this to work
> 
> 
> Thanks
> Vinoth
> 
> On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
> [email protected]> wrote:
> 
> >
> >
> > On 2019/03/08 13:43:52, Vinoth Chandar  wrote:
> > > Hi Rahul,
> > >
> > > you can try adding hoodie.parquet.small.file.limit=104857600, to your
> > > property file to specify 100MB files. Note that this works only if you
> > are
> > > using insert (not bulk_insert) operation. Hudi will enforce file sizing
> > on
> > > ingest time. As of now, there is no support for collapsing these file
> > > groups (parquet + related log files) into a large file group (HIP/Design
> > > may come soon). Does that help?
> > >
> > > Also on the compaction in general, since you don't have any updates.
> > > I think you can simply use the copy_on_write storage? inserts will go to
> > > the parquet file anyway on MOR..(but if you like to be able to deal with
> > > updates later, understand where you are going)
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
> > > [email protected]> wrote:
> > >
> > > > Dear All
> > > >
> > > > I am using DeltaStreamer to stream the data from kafka topic and to
> > write
> > > > it into the hudi data set.
> > > > For this use case I am not doing any upsert all are insert only so each
> > > > job creates new parquet file after the inject job. So  large number of
> > > > small files are creating. how can i  merge these files from
> > deltastreamer
> > > > job using the available configurations.
> > > >
> > > > I think compactionSmallFileSize may useful for this case,  but i am not
> > > > sure whether it is for deltastreamer or not. I tried it in
> > deltastreamer
> > > > but it did't worked. Please assist on this. If possible give one
> > example
> > > > for the same
> > > >
> > > > Thanks & Regards
> > > > Rahul
> > > >
> > >
> >
> >
> > Dear Vinoth
> >
> > For one of my use case , I doing only inserts.For testing i am inserting
> > data which have 5-10 records only. I  am continuously pushing data to hudi
> > dataset. As it is insert only for every insert it's creating  new small
> > files to the dataset.
> >
> > If my insertion interval is less and i am planning for data to keep for
> > years, this flow will create lots of small files.
> > I just want to know whether hudi can merge these small files in any ways.
> >
> >
> > Thanks & Regards
> > Rahul P
> >
> >
> 

Dear Vinoth

I tried below configurations.

hoodie.parquet.max.file.size=1073741824
hoodie.parquet.small.file.limit=943718400

I am using below code for inserting data from json kafka source.

spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer 
hoodie-utilities-0.4.5.jar --storage-type COPY_ON_WRITE --source-class 
com.uber.hoodie.utilities.sources.JsonKafkaSource  --source-ordering-field 
stype  --target-base-path /MERGE --target-table MERGE --props 
/hudi/kafka-source.properties  --schemaprovider-class 
com.uber.hoodie.utilities.schema.FilebasedSchemaProvider --op insert

But for each insert job it's creating new parquet file. It's not touching old 
parquet files.

For reference i am  sharing  some of the parquet files of hudi dataset which 
are generating as part of DeltaStreamer data insertion.

93  /MERGE/2019/03/06/.hoodie_partition_metadata
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002655.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002733.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002754.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002815.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002837.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002859.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002921.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002942.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003003.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003024.parquet
424.0 K  
/MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003045.parquet

Each job it's creating files of 424K & it's not merging any

Re: how to merge small parqut files in the hudi location

2019-03-11 Thread Vinoth Chandar
Hi Rahul,

Hudi/Copy-on-write storage would keep expanding your existing parquet files
to reach the configured file size, once you set the small file size
config..

For e.g: we at uber, write 1GB files this way.. to do that, you could set
something like this.
http://hudi.apache.org/configurations.html#limitFileSize  = 1 * 1024 * 1024
* 1024
http://hudi.apache.org/configurations.html#compactionSmallFileSize = 900 *
1024 * 1024


Please let me know if you have trouble achieving this. Also please use the
insert operation (not bulk_insert) for this to work


Thanks
Vinoth

On Mon, Mar 11, 2019 at 12:32 AM [email protected] <
[email protected]> wrote:

>
>
> On 2019/03/08 13:43:52, Vinoth Chandar  wrote:
> > Hi Rahul,
> >
> > you can try adding hoodie.parquet.small.file.limit=104857600, to your
> > property file to specify 100MB files. Note that this works only if you
> are
> > using insert (not bulk_insert) operation. Hudi will enforce file sizing
> on
> > ingest time. As of now, there is no support for collapsing these file
> > groups (parquet + related log files) into a large file group (HIP/Design
> > may come soon). Does that help?
> >
> > Also on the compaction in general, since you don't have any updates.
> > I think you can simply use the copy_on_write storage? inserts will go to
> > the parquet file anyway on MOR..(but if you like to be able to deal with
> > updates later, understand where you are going)
> >
> > Thanks
> > Vinoth
> >
> > On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
> > [email protected]> wrote:
> >
> > > Dear All
> > >
> > > I am using DeltaStreamer to stream the data from kafka topic and to
> write
> > > it into the hudi data set.
> > > For this use case I am not doing any upsert all are insert only so each
> > > job creates new parquet file after the inject job. So  large number of
> > > small files are creating. how can i  merge these files from
> deltastreamer
> > > job using the available configurations.
> > >
> > > I think compactionSmallFileSize may useful for this case,  but i am not
> > > sure whether it is for deltastreamer or not. I tried it in
> deltastreamer
> > > but it did't worked. Please assist on this. If possible give one
> example
> > > for the same
> > >
> > > Thanks & Regards
> > > Rahul
> > >
> >
>
>
> Dear Vinoth
>
> For one of my use case , I doing only inserts.For testing i am inserting
> data which have 5-10 records only. I  am continuously pushing data to hudi
> dataset. As it is insert only for every insert it's creating  new small
> files to the dataset.
>
> If my insertion interval is less and i am planning for data to keep for
> years, this flow will create lots of small files.
> I just want to know whether hudi can merge these small files in any ways.
>
>
> Thanks & Regards
> Rahul P
>
>


Re: how to merge small parqut files in the hudi location

2019-03-11 Thread rahuledavalath



On 2019/03/08 13:43:52, Vinoth Chandar  wrote: 
> Hi Rahul,
> 
> you can try adding hoodie.parquet.small.file.limit=104857600, to your
> property file to specify 100MB files. Note that this works only if you are
> using insert (not bulk_insert) operation. Hudi will enforce file sizing on
> ingest time. As of now, there is no support for collapsing these file
> groups (parquet + related log files) into a large file group (HIP/Design
> may come soon). Does that help?
> 
> Also on the compaction in general, since you don't have any updates.
> I think you can simply use the copy_on_write storage? inserts will go to
> the parquet file anyway on MOR..(but if you like to be able to deal with
> updates later, understand where you are going)
> 
> Thanks
> Vinoth
> 
> On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
> [email protected]> wrote:
> 
> > Dear All
> >
> > I am using DeltaStreamer to stream the data from kafka topic and to write
> > it into the hudi data set.
> > For this use case I am not doing any upsert all are insert only so each
> > job creates new parquet file after the inject job. So  large number of
> > small files are creating. how can i  merge these files from deltastreamer
> > job using the available configurations.
> >
> > I think compactionSmallFileSize may useful for this case,  but i am not
> > sure whether it is for deltastreamer or not. I tried it in deltastreamer
> > but it did't worked. Please assist on this. If possible give one example
> > for the same
> >
> > Thanks & Regards
> > Rahul
> >
> 


Dear Vinoth

For one of my use case , I doing only inserts.For testing i am inserting data 
which have 5-10 records only. I  am continuously pushing data to hudi dataset. 
As it is insert only for every insert it's creating  new small files to the 
dataset.  

If my insertion interval is less and i am planning for data to keep for years, 
this flow will create lots of small files. 
I just want to know whether hudi can merge these small files in any ways.


Thanks & Regards
Rahul P



Re: how to merge small parqut files in the hudi location

2019-03-08 Thread Vinoth Chandar
Hi Rahul,

you can try adding hoodie.parquet.small.file.limit=104857600, to your
property file to specify 100MB files. Note that this works only if you are
using insert (not bulk_insert) operation. Hudi will enforce file sizing on
ingest time. As of now, there is no support for collapsing these file
groups (parquet + related log files) into a large file group (HIP/Design
may come soon). Does that help?

Also on the compaction in general, since you don't have any updates.
I think you can simply use the copy_on_write storage? inserts will go to
the parquet file anyway on MOR..(but if you like to be able to deal with
updates later, understand where you are going)

Thanks
Vinoth

On Fri, Mar 8, 2019 at 3:25 AM [email protected] <
[email protected]> wrote:

> Dear All
>
> I am using DeltaStreamer to stream the data from kafka topic and to write
> it into the hudi data set.
> For this use case I am not doing any upsert all are insert only so each
> job creates new parquet file after the inject job. So  large number of
> small files are creating. how can i  merge these files from deltastreamer
> job using the available configurations.
>
> I think compactionSmallFileSize may useful for this case,  but i am not
> sure whether it is for deltastreamer or not. I tried it in deltastreamer
> but it did't worked. Please assist on this. If possible give one example
> for the same
>
> Thanks & Regards
> Rahul
>