Mmh, I see. Hence I should avoid split by string fields, since my hash
field is 72 char long and it requires a lot of computations (if I
understood correctly).
I think one of the solutions proposed by Jarcec could be ok.
I also think I'll divide my big table in more little chunks, since the
problem is the query that determine the split points. What do you
think?

Cheers,
Alberto


On 5 September 2012 15:21, abhijeet gaikwad <[email protected]> wrote:
> Hi Alberto,
> Sqoop calculates split points by converting min. and max. string values
> returned by DBMS for the column into respective decimal values using an
> algo. and then use Decimal splitter. There is a complex algorithm used for
> converting(more of mapping) string to decimal. This would help you
> understand better (taken from the java docs for the split method of text
> splitter):
>
> /**
>    * This method needs to determine the splits between two user-provided
>    * strings.  In the case where the user's strings are 'A' and 'Z', this is
>    * not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26
>    * splits for strings beginning with each letter, etc.
>    *
>    * If a user has provided us with the strings "Ham" and "Haze", however,
> we
>    * need to create splits that differ in the third letter.
>    *
>    * The algorithm used is as follows:
>    * Since there are 2**16 unicode characters, we interpret characters as
>    * digits in base 65536. Given a string 's' containing characters s_0, s_1
>    * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in
>    * base 65536. Having mapped the low and high strings into floating-point
>    * values, we then use the BigDecimalSplitter to establish the even split
>    * points, then map the resulting floating point values back into strings.
>    */
>   public List<InputSplit> split(Configuration conf, ResultSet results,
>       String colName) throws SQLException {
>
> After the splits are calculated, where clauses are used in SELECT
> queries(i.e. - result is bounded by split points)) fired by each mapper to
> retrieve the data.
>
> From user perspective, you can use string for splitting except for following
> scenario:
> Char split-by column is not recommended when the DBMS sorts in case
> insensitive manner. The current algorithm used to calculate splits has some
> flaws. This is known and Sqoop displays a warning before executing the job.
>
> Let me know if you need more details.
>
> Thanks,
> Abhijeet
>
>
> On Wed, Sep 5, 2012 at 5:13 PM, Jarek Jarcec Cecho <[email protected]>
> wrote:
>>
>> Hi Alberto,
>> I've never used text column for data splitting, however it seems that
>> sqoop is supporting that (I found it's splitter in the code). However I'm
>> still not sure if it's wise as string operations tends to be much slower on
>> databases and you might end up with performance issues. Unfortunately Sqoop
>> currently do not support any direct way how to affect split creation.
>>
>> I tried to think about your problem and came with two ideas how to help in
>> your use case:
>>
>> 1) Would it be acceptable in your use case to change the zero date policy
>> from zeroDateTimeBehavior=round to zeroDateTimeBehavior=convertToNull? In
>> case that "split" column contains nulls, sqoop will create X+1 splits where
>> the +1 will cover all NULL values. It probably won't be the best, but it
>> might help to distribute your load more properly.
>>
>> 2) What about splitting entire job into two parts - firstly export all
>> zero dates and separately in next job the rest of the values. By doing so
>> you might be able to get decent distribution across the "normal" dates part.
>> Importing all the zero dates might be challenging if you have a lot of them
>> as there will be only one value available (and thus just one split) and
>> therefore you might need to use the text column for split creation in this
>> case anyway.
>>
>> Jarcec
>>
>> On Wed, Sep 05, 2012 at 10:16:17AM +0200, Alberto Cordioli wrote:
>> > Thanks Jarcec,
>> > probably you've identified immediately the problem. In fact, I checked
>> > the date field, and I think problem is that in my data I have some
>> > "limit" values like '0000-00-00' (damn who have inserted these).
>> > The other data are equally distributed in 2 months (from 2012-04-01 to
>> > 2012-06-01): as you said with a parallelism of 3, 2 mappers will take
>> > basically no data while the other will do the "true" job, right?
>> >
>> > So, now my question becomes: the other field that I could use to split
>> > the job is an hash (string). How sqoop divide this type of field?
>> > Lexicography order?
>> >
>> >
>> > Alberto
>> >
>> >
>> >
>> > On 5 September 2012 09:57, Jarek Jarcec Cecho <[email protected]> wrote:
>> > > Hi Alberto,
>> > > taking into account that you have 910 millions of records and you're
>> > > job was able to get to 75% in matter of 8 minutes and then it slow down
>> > > significantly, I do have a feeling that your splits were not equally
>> > > divided. Based on your command line it seems that you're diving data by 
>> > > some
>> > > date field. Is this date field uniformly distributed? E.g. is there 
>> > > roughly
>> > > same number of rows for each date or do you have more rows in more recent
>> > > days?
>> > >
>> > > Because Sqoop have no idea how exactly the data are distributed in
>> > > your database, it assumes uniform distribution. Let me explain why it
>> > > matters on following example. Let's consider table where there is one 
>> > > row on
>> > > 2012-01-01, second row on 2012-02-01 and 1M of rows on 2012-03-01. Let's
>> > > assume that we will use three mappers (--num-mappers 3). In this case, 
>> > > sqoop
>> > > will create three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up to
>> > > 2012-02-28 and 2012-03-01 up to 2012-03-31. Because the first two 
>> > > mappers do
>> > > have just one row to move, they will finish almost instantly and get job 
>> > > to
>> > > 66% done (2 out of 3 mappers are done), however the last mapper will be
>> > > running for some time as it need to move 1M of rows. For external 
>> > > observer
>> > > it would appear that the sqoop has stopped, but what really happened is 
>> > > just
>> > > having not uniformly distributed data across all mappers.
>> > >
>> > > Jarcec
>> > >
>> > > On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote:
>> > >> Hi all,
>> > >>
>> > >> I am using Sqoop to import a big MySql table (around 910 milions of
>> > >> records) in Hbase.
>> > >> The command line that I'm using is something like:
>> > >> sqoop import --connect
>> > >> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username
>> > >> <usr>
>> > >> -P --query <query>' --split-by <date-field> --hbase-table
>> > >> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash"
>> > >>
>> > >> The strange thing is that it takes a lot to complete the last part of
>> > >> the map. This is part of the log:
>> > >>
>> > >> [...]
>> > >> 12/09/04 17:16:45 INFO mapred.JobClient: Running job:
>> > >> job_201209031227_0007
>> > >> 12/09/04 17:16:46 INFO mapred.JobClient:  map 0% reduce 0%
>> > >> 12/09/04 17:24:20 INFO mapred.JobClient:  map 25% reduce 0%
>> > >> 12/09/04 17:24:21 INFO mapred.JobClient:  map 50% reduce 0%
>> > >> 12/09/04 17:24:23 INFO mapred.JobClient:  map 75% reduce 0%
>> > >>
>> > >> As you can see it does not take much time to from start to 75%, but
>> > >> the last part hasn't been finished (although it is working by a day
>> > >> continuously).
>> > >> Is there something wrong? I've tried to take a look to the logs but
>> > >> it
>> > >> seems to be ok.
>> > >>
>> > >>
>> > >> Thanks,
>> > >> Alberto
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >> Alberto Cordioli
>> >
>> >
>> >
>> > --
>> > Alberto Cordioli
>
>



-- 
Alberto Cordioli

Reply via email to