I forgot mentioning but calculations are done for maximum first 8 chars of the string. So computation wise you are safe - but that may generate in-accurate splits for some scenarios. Anyways, I feel Jarcec's solutions are a better workaround :)
Thanks, Abhijeet On 5 Sep 2012 20:00, "Alberto Cordioli" <[email protected]> wrote: > Mmh, I see. Hence I should avoid split by string fields, since my hash > field is 72 char long and it requires a lot of computations (if I > understood correctly). > I think one of the solutions proposed by Jarcec could be ok. > I also think I'll divide my big table in more little chunks, since the > problem is the query that determine the split points. What do you > think? > > Cheers, > Alberto > > > On 5 September 2012 15:21, abhijeet gaikwad <[email protected]> > wrote: > > Hi Alberto, > > Sqoop calculates split points by converting min. and max. string values > > returned by DBMS for the column into respective decimal values using an > > algo. and then use Decimal splitter. There is a complex algorithm used > for > > converting(more of mapping) string to decimal. This would help you > > understand better (taken from the java docs for the split method of text > > splitter): > > > > /** > > * This method needs to determine the splits between two user-provided > > * strings. In the case where the user's strings are 'A' and 'Z', > this is > > * not hard; we could create two splits from ['A', 'M') and ['M', > 'Z'], 26 > > * splits for strings beginning with each letter, etc. > > * > > * If a user has provided us with the strings "Ham" and "Haze", > however, > > we > > * need to create splits that differ in the third letter. > > * > > * The algorithm used is as follows: > > * Since there are 2**16 unicode characters, we interpret characters as > > * digits in base 65536. Given a string 's' containing characters s_0, > s_1 > > * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n > in > > * base 65536. Having mapped the low and high strings into > floating-point > > * values, we then use the BigDecimalSplitter to establish the even > split > > * points, then map the resulting floating point values back into > strings. > > */ > > public List<InputSplit> split(Configuration conf, ResultSet results, > > String colName) throws SQLException { > > > > After the splits are calculated, where clauses are used in SELECT > > queries(i.e. - result is bounded by split points)) fired by each mapper > to > > retrieve the data. > > > > From user perspective, you can use string for splitting except for > following > > scenario: > > Char split-by column is not recommended when the DBMS sorts in case > > insensitive manner. The current algorithm used to calculate splits has > some > > flaws. This is known and Sqoop displays a warning before executing the > job. > > > > Let me know if you need more details. > > > > Thanks, > > Abhijeet > > > > > > On Wed, Sep 5, 2012 at 5:13 PM, Jarek Jarcec Cecho <[email protected]> > > wrote: > >> > >> Hi Alberto, > >> I've never used text column for data splitting, however it seems that > >> sqoop is supporting that (I found it's splitter in the code). However > I'm > >> still not sure if it's wise as string operations tends to be much > slower on > >> databases and you might end up with performance issues. Unfortunately > Sqoop > >> currently do not support any direct way how to affect split creation. > >> > >> I tried to think about your problem and came with two ideas how to help > in > >> your use case: > >> > >> 1) Would it be acceptable in your use case to change the zero date > policy > >> from zeroDateTimeBehavior=round to zeroDateTimeBehavior=convertToNull? > In > >> case that "split" column contains nulls, sqoop will create X+1 splits > where > >> the +1 will cover all NULL values. It probably won't be the best, but it > >> might help to distribute your load more properly. > >> > >> 2) What about splitting entire job into two parts - firstly export all > >> zero dates and separately in next job the rest of the values. By doing > so > >> you might be able to get decent distribution across the "normal" dates > part. > >> Importing all the zero dates might be challenging if you have a lot of > them > >> as there will be only one value available (and thus just one split) and > >> therefore you might need to use the text column for split creation in > this > >> case anyway. > >> > >> Jarcec > >> > >> On Wed, Sep 05, 2012 at 10:16:17AM +0200, Alberto Cordioli wrote: > >> > Thanks Jarcec, > >> > probably you've identified immediately the problem. In fact, I checked > >> > the date field, and I think problem is that in my data I have some > >> > "limit" values like '0000-00-00' (damn who have inserted these). > >> > The other data are equally distributed in 2 months (from 2012-04-01 to > >> > 2012-06-01): as you said with a parallelism of 3, 2 mappers will take > >> > basically no data while the other will do the "true" job, right? > >> > > >> > So, now my question becomes: the other field that I could use to split > >> > the job is an hash (string). How sqoop divide this type of field? > >> > Lexicography order? > >> > > >> > > >> > Alberto > >> > > >> > > >> > > >> > On 5 September 2012 09:57, Jarek Jarcec Cecho <[email protected]> > wrote: > >> > > Hi Alberto, > >> > > taking into account that you have 910 millions of records and you're > >> > > job was able to get to 75% in matter of 8 minutes and then it slow > down > >> > > significantly, I do have a feeling that your splits were not equally > >> > > divided. Based on your command line it seems that you're diving > data by some > >> > > date field. Is this date field uniformly distributed? E.g. is there > roughly > >> > > same number of rows for each date or do you have more rows in more > recent > >> > > days? > >> > > > >> > > Because Sqoop have no idea how exactly the data are distributed in > >> > > your database, it assumes uniform distribution. Let me explain why > it > >> > > matters on following example. Let's consider table where there is > one row on > >> > > 2012-01-01, second row on 2012-02-01 and 1M of rows on 2012-03-01. > Let's > >> > > assume that we will use three mappers (--num-mappers 3). In this > case, sqoop > >> > > will create three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up > to > >> > > 2012-02-28 and 2012-03-01 up to 2012-03-31. Because the first two > mappers do > >> > > have just one row to move, they will finish almost instantly and > get job to > >> > > 66% done (2 out of 3 mappers are done), however the last mapper > will be > >> > > running for some time as it need to move 1M of rows. For external > observer > >> > > it would appear that the sqoop has stopped, but what really > happened is just > >> > > having not uniformly distributed data across all mappers. > >> > > > >> > > Jarcec > >> > > > >> > > On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote: > >> > >> Hi all, > >> > >> > >> > >> I am using Sqoop to import a big MySql table (around 910 milions of > >> > >> records) in Hbase. > >> > >> The command line that I'm using is something like: > >> > >> sqoop import --connect > >> > >> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username > >> > >> <usr> > >> > >> -P --query <query>' --split-by <date-field> --hbase-table > >> > >> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash" > >> > >> > >> > >> The strange thing is that it takes a lot to complete the last part > of > >> > >> the map. This is part of the log: > >> > >> > >> > >> [...] > >> > >> 12/09/04 17:16:45 INFO mapred.JobClient: Running job: > >> > >> job_201209031227_0007 > >> > >> 12/09/04 17:16:46 INFO mapred.JobClient: map 0% reduce 0% > >> > >> 12/09/04 17:24:20 INFO mapred.JobClient: map 25% reduce 0% > >> > >> 12/09/04 17:24:21 INFO mapred.JobClient: map 50% reduce 0% > >> > >> 12/09/04 17:24:23 INFO mapred.JobClient: map 75% reduce 0% > >> > >> > >> > >> As you can see it does not take much time to from start to 75%, but > >> > >> the last part hasn't been finished (although it is working by a day > >> > >> continuously). > >> > >> Is there something wrong? I've tried to take a look to the logs but > >> > >> it > >> > >> seems to be ok. > >> > >> > >> > >> > >> > >> Thanks, > >> > >> Alberto > >> > >> > >> > >> > >> > >> > >> > >> -- > >> > >> Alberto Cordioli > >> > > >> > > >> > > >> > -- > >> > Alberto Cordioli > > > > > > > > -- > Alberto Cordioli >
