Re: Sqoop import big MySql table in HBase

abhijeet gaikwad Wed, 05 Sep 2012 07:50:45 -0700

I forgot mentioning but calculations are done for maximum first 8 chars of
the string. So computation wise you are safe - but that may generate
in-accurate splits for some scenarios.
Anyways, I feel Jarcec's solutions are a better workaround :)


Thanks,
Abhijeet
On 5 Sep 2012 20:00, "Alberto Cordioli" <[email protected]> wrote:

> Mmh, I see. Hence I should avoid split by string fields, since my hash
> field is 72 char long and it requires a lot of computations (if I
> understood correctly).
> I think one of the solutions proposed by Jarcec could be ok.
> I also think I'll divide my big table in more little chunks, since the
> problem is the query that determine the split points. What do you
> think?
>
> Cheers,
> Alberto
>
>
> On 5 September 2012 15:21, abhijeet gaikwad <[email protected]>
> wrote:
> > Hi Alberto,
> > Sqoop calculates split points by converting min. and max. string values
> > returned by DBMS for the column into respective decimal values using an
> > algo. and then use Decimal splitter. There is a complex algorithm used
> for
> > converting(more of mapping) string to decimal. This would help you
> > understand better (taken from the java docs for the split method of text
> > splitter):
> >
> > /**
> >    * This method needs to determine the splits between two user-provided
> >    * strings.  In the case where the user's strings are 'A' and 'Z',
> this is
> >    * not hard; we could create two splits from ['A', 'M') and ['M',
> 'Z'], 26
> >    * splits for strings beginning with each letter, etc.
> >    *
> >    * If a user has provided us with the strings "Ham" and "Haze",
> however,
> > we
> >    * need to create splits that differ in the third letter.
> >    *
> >    * The algorithm used is as follows:
> >    * Since there are 2**16 unicode characters, we interpret characters as
> >    * digits in base 65536. Given a string 's' containing characters s_0,
> s_1
> >    * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n
> in
> >    * base 65536. Having mapped the low and high strings into
> floating-point
> >    * values, we then use the BigDecimalSplitter to establish the even
> split
> >    * points, then map the resulting floating point values back into
> strings.
> >    */
> >   public List<InputSplit> split(Configuration conf, ResultSet results,
> >       String colName) throws SQLException {
> >
> > After the splits are calculated, where clauses are used in SELECT
> > queries(i.e. - result is bounded by split points)) fired by each mapper
> to
> > retrieve the data.
> >
> > From user perspective, you can use string for splitting except for
> following
> > scenario:
> > Char split-by column is not recommended when the DBMS sorts in case
> > insensitive manner. The current algorithm used to calculate splits has
> some
> > flaws. This is known and Sqoop displays a warning before executing the
> job.
> >
> > Let me know if you need more details.
> >
> > Thanks,
> > Abhijeet
> >
> >
> > On Wed, Sep 5, 2012 at 5:13 PM, Jarek Jarcec Cecho <[email protected]>
> > wrote:
> >>
> >> Hi Alberto,
> >> I've never used text column for data splitting, however it seems that
> >> sqoop is supporting that (I found it's splitter in the code). However
> I'm
> >> still not sure if it's wise as string operations tends to be much
> slower on
> >> databases and you might end up with performance issues. Unfortunately
> Sqoop
> >> currently do not support any direct way how to affect split creation.
> >>
> >> I tried to think about your problem and came with two ideas how to help
> in
> >> your use case:
> >>
> >> 1) Would it be acceptable in your use case to change the zero date
> policy
> >> from zeroDateTimeBehavior=round to zeroDateTimeBehavior=convertToNull?
> In
> >> case that "split" column contains nulls, sqoop will create X+1 splits
> where
> >> the +1 will cover all NULL values. It probably won't be the best, but it
> >> might help to distribute your load more properly.
> >>
> >> 2) What about splitting entire job into two parts - firstly export all
> >> zero dates and separately in next job the rest of the values. By doing
> so
> >> you might be able to get decent distribution across the "normal" dates
> part.
> >> Importing all the zero dates might be challenging if you have a lot of
> them
> >> as there will be only one value available (and thus just one split) and
> >> therefore you might need to use the text column for split creation in
> this
> >> case anyway.
> >>
> >> Jarcec
> >>
> >> On Wed, Sep 05, 2012 at 10:16:17AM +0200, Alberto Cordioli wrote:
> >> > Thanks Jarcec,
> >> > probably you've identified immediately the problem. In fact, I checked
> >> > the date field, and I think problem is that in my data I have some
> >> > "limit" values like '0000-00-00' (damn who have inserted these).
> >> > The other data are equally distributed in 2 months (from 2012-04-01 to
> >> > 2012-06-01): as you said with a parallelism of 3, 2 mappers will take
> >> > basically no data while the other will do the "true" job, right?
> >> >
> >> > So, now my question becomes: the other field that I could use to split
> >> > the job is an hash (string). How sqoop divide this type of field?
> >> > Lexicography order?
> >> >
> >> >
> >> > Alberto
> >> >
> >> >
> >> >
> >> > On 5 September 2012 09:57, Jarek Jarcec Cecho <[email protected]>
> wrote:
> >> > > Hi Alberto,
> >> > > taking into account that you have 910 millions of records and you're
> >> > > job was able to get to 75% in matter of 8 minutes and then it slow
> down
> >> > > significantly, I do have a feeling that your splits were not equally
> >> > > divided. Based on your command line it seems that you're diving
> data by some
> >> > > date field. Is this date field uniformly distributed? E.g. is there
> roughly
> >> > > same number of rows for each date or do you have more rows in more
> recent
> >> > > days?
> >> > >
> >> > > Because Sqoop have no idea how exactly the data are distributed in
> >> > > your database, it assumes uniform distribution. Let me explain why
> it
> >> > > matters on following example. Let's consider table where there is
> one row on
> >> > > 2012-01-01, second row on 2012-02-01 and 1M of rows on 2012-03-01.
> Let's
> >> > > assume that we will use three mappers (--num-mappers 3). In this
> case, sqoop
> >> > > will create three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up
> to
> >> > > 2012-02-28 and 2012-03-01 up to 2012-03-31. Because the first two
> mappers do
> >> > > have just one row to move, they will finish almost instantly and
> get job to
> >> > > 66% done (2 out of 3 mappers are done), however the last mapper
> will be
> >> > > running for some time as it need to move 1M of rows. For external
> observer
> >> > > it would appear that the sqoop has stopped, but what really
> happened is just
> >> > > having not uniformly distributed data across all mappers.
> >> > >
> >> > > Jarcec
> >> > >
> >> > > On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote:
> >> > >> Hi all,
> >> > >>
> >> > >> I am using Sqoop to import a big MySql table (around 910 milions of
> >> > >> records) in Hbase.
> >> > >> The command line that I'm using is something like:
> >> > >> sqoop import --connect
> >> > >> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username
> >> > >> <usr>
> >> > >> -P --query <query>' --split-by <date-field> --hbase-table
> >> > >> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash"
> >> > >>
> >> > >> The strange thing is that it takes a lot to complete the last part
> of
> >> > >> the map. This is part of the log:
> >> > >>
> >> > >> [...]
> >> > >> 12/09/04 17:16:45 INFO mapred.JobClient: Running job:
> >> > >> job_201209031227_0007
> >> > >> 12/09/04 17:16:46 INFO mapred.JobClient:  map 0% reduce 0%
> >> > >> 12/09/04 17:24:20 INFO mapred.JobClient:  map 25% reduce 0%
> >> > >> 12/09/04 17:24:21 INFO mapred.JobClient:  map 50% reduce 0%
> >> > >> 12/09/04 17:24:23 INFO mapred.JobClient:  map 75% reduce 0%
> >> > >>
> >> > >> As you can see it does not take much time to from start to 75%, but
> >> > >> the last part hasn't been finished (although it is working by a day
> >> > >> continuously).
> >> > >> Is there something wrong? I've tried to take a look to the logs but
> >> > >> it
> >> > >> seems to be ok.
> >> > >>
> >> > >>
> >> > >> Thanks,
> >> > >> Alberto
> >> > >>
> >> > >>
> >> > >>
> >> > >> --
> >> > >> Alberto Cordioli
> >> >
> >> >
> >> >
> >> > --
> >> > Alberto Cordioli
> >
> >
>
>
>
> --
> Alberto Cordioli
>

Re: Sqoop import big MySql table in HBase

Reply via email to