Re: Sqoop import big MySql table in HBase

Alberto Cordioli Fri, 07 Sep 2012 00:21:29 -0700

Perfect.
Thanks for the help!

Cheers,
Alberto


On 5 September 2012 16:50, abhijeet gaikwad <[email protected]> wrote:
> I forgot mentioning but calculations are done for maximum first 8 chars of
> the string. So computation wise you are safe - but that may generate
> in-accurate splits for some scenarios.
> Anyways, I feel Jarcec's solutions are a better workaround :)
>
> Thanks,
> Abhijeet
>
> On 5 Sep 2012 20:00, "Alberto Cordioli" <[email protected]> wrote:
>>
>> Mmh, I see. Hence I should avoid split by string fields, since my hash
>> field is 72 char long and it requires a lot of computations (if I
>> understood correctly).
>> I think one of the solutions proposed by Jarcec could be ok.
>> I also think I'll divide my big table in more little chunks, since the
>> problem is the query that determine the split points. What do you
>> think?
>>
>> Cheers,
>> Alberto
>>
>>
>> On 5 September 2012 15:21, abhijeet gaikwad <[email protected]>
>> wrote:
>> > Hi Alberto,
>> > Sqoop calculates split points by converting min. and max. string values
>> > returned by DBMS for the column into respective decimal values using an
>> > algo. and then use Decimal splitter. There is a complex algorithm used
>> > for
>> > converting(more of mapping) string to decimal. This would help you
>> > understand better (taken from the java docs for the split method of text
>> > splitter):
>> >
>> > /**
>> >    * This method needs to determine the splits between two user-provided
>> >    * strings.  In the case where the user's strings are 'A' and 'Z',
>> > this is
>> >    * not hard; we could create two splits from ['A', 'M') and ['M',
>> > 'Z'], 26
>> >    * splits for strings beginning with each letter, etc.
>> >    *
>> >    * If a user has provided us with the strings "Ham" and "Haze",
>> > however,
>> > we
>> >    * need to create splits that differ in the third letter.
>> >    *
>> >    * The algorithm used is as follows:
>> >    * Since there are 2**16 unicode characters, we interpret characters
>> > as
>> >    * digits in base 65536. Given a string 's' containing characters s_0,
>> > s_1
>> >    * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n
>> > in
>> >    * base 65536. Having mapped the low and high strings into
>> > floating-point
>> >    * values, we then use the BigDecimalSplitter to establish the even
>> > split
>> >    * points, then map the resulting floating point values back into
>> > strings.
>> >    */
>> >   public List<InputSplit> split(Configuration conf, ResultSet results,
>> >       String colName) throws SQLException {
>> >
>> > After the splits are calculated, where clauses are used in SELECT
>> > queries(i.e. - result is bounded by split points)) fired by each mapper
>> > to
>> > retrieve the data.
>> >
>> > From user perspective, you can use string for splitting except for
>> > following
>> > scenario:
>> > Char split-by column is not recommended when the DBMS sorts in case
>> > insensitive manner. The current algorithm used to calculate splits has
>> > some
>> > flaws. This is known and Sqoop displays a warning before executing the
>> > job.
>> >
>> > Let me know if you need more details.
>> >
>> > Thanks,
>> > Abhijeet
>> >
>> >
>> > On Wed, Sep 5, 2012 at 5:13 PM, Jarek Jarcec Cecho <[email protected]>
>> > wrote:
>> >>
>> >> Hi Alberto,
>> >> I've never used text column for data splitting, however it seems that
>> >> sqoop is supporting that (I found it's splitter in the code). However
>> >> I'm
>> >> still not sure if it's wise as string operations tends to be much
>> >> slower on
>> >> databases and you might end up with performance issues. Unfortunately
>> >> Sqoop
>> >> currently do not support any direct way how to affect split creation.
>> >>
>> >> I tried to think about your problem and came with two ideas how to help
>> >> in
>> >> your use case:
>> >>
>> >> 1) Would it be acceptable in your use case to change the zero date
>> >> policy
>> >> from zeroDateTimeBehavior=round to zeroDateTimeBehavior=convertToNull?
>> >> In
>> >> case that "split" column contains nulls, sqoop will create X+1 splits
>> >> where
>> >> the +1 will cover all NULL values. It probably won't be the best, but
>> >> it
>> >> might help to distribute your load more properly.
>> >>
>> >> 2) What about splitting entire job into two parts - firstly export all
>> >> zero dates and separately in next job the rest of the values. By doing
>> >> so
>> >> you might be able to get decent distribution across the "normal" dates
>> >> part.
>> >> Importing all the zero dates might be challenging if you have a lot of
>> >> them
>> >> as there will be only one value available (and thus just one split) and
>> >> therefore you might need to use the text column for split creation in
>> >> this
>> >> case anyway.
>> >>
>> >> Jarcec
>> >>
>> >> On Wed, Sep 05, 2012 at 10:16:17AM +0200, Alberto Cordioli wrote:
>> >> > Thanks Jarcec,
>> >> > probably you've identified immediately the problem. In fact, I
>> >> > checked
>> >> > the date field, and I think problem is that in my data I have some
>> >> > "limit" values like '0000-00-00' (damn who have inserted these).
>> >> > The other data are equally distributed in 2 months (from 2012-04-01
>> >> > to
>> >> > 2012-06-01): as you said with a parallelism of 3, 2 mappers will take
>> >> > basically no data while the other will do the "true" job, right?
>> >> >
>> >> > So, now my question becomes: the other field that I could use to
>> >> > split
>> >> > the job is an hash (string). How sqoop divide this type of field?
>> >> > Lexicography order?
>> >> >
>> >> >
>> >> > Alberto
>> >> >
>> >> >
>> >> >
>> >> > On 5 September 2012 09:57, Jarek Jarcec Cecho <[email protected]>
>> >> > wrote:
>> >> > > Hi Alberto,
>> >> > > taking into account that you have 910 millions of records and
>> >> > > you're
>> >> > > job was able to get to 75% in matter of 8 minutes and then it slow
>> >> > > down
>> >> > > significantly, I do have a feeling that your splits were not
>> >> > > equally
>> >> > > divided. Based on your command line it seems that you're diving
>> >> > > data by some
>> >> > > date field. Is this date field uniformly distributed? E.g. is there
>> >> > > roughly
>> >> > > same number of rows for each date or do you have more rows in more
>> >> > > recent
>> >> > > days?
>> >> > >
>> >> > > Because Sqoop have no idea how exactly the data are distributed in
>> >> > > your database, it assumes uniform distribution. Let me explain why
>> >> > > it
>> >> > > matters on following example. Let's consider table where there is
>> >> > > one row on
>> >> > > 2012-01-01, second row on 2012-02-01 and 1M of rows on 2012-03-01.
>> >> > > Let's
>> >> > > assume that we will use three mappers (--num-mappers 3). In this
>> >> > > case, sqoop
>> >> > > will create three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up
>> >> > > to
>> >> > > 2012-02-28 and 2012-03-01 up to 2012-03-31. Because the first two
>> >> > > mappers do
>> >> > > have just one row to move, they will finish almost instantly and
>> >> > > get job to
>> >> > > 66% done (2 out of 3 mappers are done), however the last mapper
>> >> > > will be
>> >> > > running for some time as it need to move 1M of rows. For external
>> >> > > observer
>> >> > > it would appear that the sqoop has stopped, but what really
>> >> > > happened is just
>> >> > > having not uniformly distributed data across all mappers.
>> >> > >
>> >> > > Jarcec
>> >> > >
>> >> > > On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote:
>> >> > >> Hi all,
>> >> > >>
>> >> > >> I am using Sqoop to import a big MySql table (around 910 milions
>> >> > >> of
>> >> > >> records) in Hbase.
>> >> > >> The command line that I'm using is something like:
>> >> > >> sqoop import --connect
>> >> > >> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username
>> >> > >> <usr>
>> >> > >> -P --query <query>' --split-by <date-field> --hbase-table
>> >> > >> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash"
>> >> > >>
>> >> > >> The strange thing is that it takes a lot to complete the last part
>> >> > >> of
>> >> > >> the map. This is part of the log:
>> >> > >>
>> >> > >> [...]
>> >> > >> 12/09/04 17:16:45 INFO mapred.JobClient: Running job:
>> >> > >> job_201209031227_0007
>> >> > >> 12/09/04 17:16:46 INFO mapred.JobClient:  map 0% reduce 0%
>> >> > >> 12/09/04 17:24:20 INFO mapred.JobClient:  map 25% reduce 0%
>> >> > >> 12/09/04 17:24:21 INFO mapred.JobClient:  map 50% reduce 0%
>> >> > >> 12/09/04 17:24:23 INFO mapred.JobClient:  map 75% reduce 0%
>> >> > >>
>> >> > >> As you can see it does not take much time to from start to 75%,
>> >> > >> but
>> >> > >> the last part hasn't been finished (although it is working by a
>> >> > >> day
>> >> > >> continuously).
>> >> > >> Is there something wrong? I've tried to take a look to the logs
>> >> > >> but
>> >> > >> it
>> >> > >> seems to be ok.
>> >> > >>
>> >> > >>
>> >> > >> Thanks,
>> >> > >> Alberto
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >> --
>> >> > >> Alberto Cordioli
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Alberto Cordioli
>> >
>> >
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Re: Sqoop import big MySql table in HBase

Reply via email to