Mmh, I see. Hence I should avoid split by string fields, since my hash field is 72 char long and it requires a lot of computations (if I understood correctly). I think one of the solutions proposed by Jarcec could be ok. I also think I'll divide my big table in more little chunks, since the problem is the query that determine the split points. What do you think?
Cheers, Alberto On 5 September 2012 15:21, abhijeet gaikwad <[email protected]> wrote: > Hi Alberto, > Sqoop calculates split points by converting min. and max. string values > returned by DBMS for the column into respective decimal values using an > algo. and then use Decimal splitter. There is a complex algorithm used for > converting(more of mapping) string to decimal. This would help you > understand better (taken from the java docs for the split method of text > splitter): > > /** > * This method needs to determine the splits between two user-provided > * strings. In the case where the user's strings are 'A' and 'Z', this is > * not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26 > * splits for strings beginning with each letter, etc. > * > * If a user has provided us with the strings "Ham" and "Haze", however, > we > * need to create splits that differ in the third letter. > * > * The algorithm used is as follows: > * Since there are 2**16 unicode characters, we interpret characters as > * digits in base 65536. Given a string 's' containing characters s_0, s_1 > * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in > * base 65536. Having mapped the low and high strings into floating-point > * values, we then use the BigDecimalSplitter to establish the even split > * points, then map the resulting floating point values back into strings. > */ > public List<InputSplit> split(Configuration conf, ResultSet results, > String colName) throws SQLException { > > After the splits are calculated, where clauses are used in SELECT > queries(i.e. - result is bounded by split points)) fired by each mapper to > retrieve the data. > > From user perspective, you can use string for splitting except for following > scenario: > Char split-by column is not recommended when the DBMS sorts in case > insensitive manner. The current algorithm used to calculate splits has some > flaws. This is known and Sqoop displays a warning before executing the job. > > Let me know if you need more details. > > Thanks, > Abhijeet > > > On Wed, Sep 5, 2012 at 5:13 PM, Jarek Jarcec Cecho <[email protected]> > wrote: >> >> Hi Alberto, >> I've never used text column for data splitting, however it seems that >> sqoop is supporting that (I found it's splitter in the code). However I'm >> still not sure if it's wise as string operations tends to be much slower on >> databases and you might end up with performance issues. Unfortunately Sqoop >> currently do not support any direct way how to affect split creation. >> >> I tried to think about your problem and came with two ideas how to help in >> your use case: >> >> 1) Would it be acceptable in your use case to change the zero date policy >> from zeroDateTimeBehavior=round to zeroDateTimeBehavior=convertToNull? In >> case that "split" column contains nulls, sqoop will create X+1 splits where >> the +1 will cover all NULL values. It probably won't be the best, but it >> might help to distribute your load more properly. >> >> 2) What about splitting entire job into two parts - firstly export all >> zero dates and separately in next job the rest of the values. By doing so >> you might be able to get decent distribution across the "normal" dates part. >> Importing all the zero dates might be challenging if you have a lot of them >> as there will be only one value available (and thus just one split) and >> therefore you might need to use the text column for split creation in this >> case anyway. >> >> Jarcec >> >> On Wed, Sep 05, 2012 at 10:16:17AM +0200, Alberto Cordioli wrote: >> > Thanks Jarcec, >> > probably you've identified immediately the problem. In fact, I checked >> > the date field, and I think problem is that in my data I have some >> > "limit" values like '0000-00-00' (damn who have inserted these). >> > The other data are equally distributed in 2 months (from 2012-04-01 to >> > 2012-06-01): as you said with a parallelism of 3, 2 mappers will take >> > basically no data while the other will do the "true" job, right? >> > >> > So, now my question becomes: the other field that I could use to split >> > the job is an hash (string). How sqoop divide this type of field? >> > Lexicography order? >> > >> > >> > Alberto >> > >> > >> > >> > On 5 September 2012 09:57, Jarek Jarcec Cecho <[email protected]> wrote: >> > > Hi Alberto, >> > > taking into account that you have 910 millions of records and you're >> > > job was able to get to 75% in matter of 8 minutes and then it slow down >> > > significantly, I do have a feeling that your splits were not equally >> > > divided. Based on your command line it seems that you're diving data by >> > > some >> > > date field. Is this date field uniformly distributed? E.g. is there >> > > roughly >> > > same number of rows for each date or do you have more rows in more recent >> > > days? >> > > >> > > Because Sqoop have no idea how exactly the data are distributed in >> > > your database, it assumes uniform distribution. Let me explain why it >> > > matters on following example. Let's consider table where there is one >> > > row on >> > > 2012-01-01, second row on 2012-02-01 and 1M of rows on 2012-03-01. Let's >> > > assume that we will use three mappers (--num-mappers 3). In this case, >> > > sqoop >> > > will create three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up to >> > > 2012-02-28 and 2012-03-01 up to 2012-03-31. Because the first two >> > > mappers do >> > > have just one row to move, they will finish almost instantly and get job >> > > to >> > > 66% done (2 out of 3 mappers are done), however the last mapper will be >> > > running for some time as it need to move 1M of rows. For external >> > > observer >> > > it would appear that the sqoop has stopped, but what really happened is >> > > just >> > > having not uniformly distributed data across all mappers. >> > > >> > > Jarcec >> > > >> > > On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote: >> > >> Hi all, >> > >> >> > >> I am using Sqoop to import a big MySql table (around 910 milions of >> > >> records) in Hbase. >> > >> The command line that I'm using is something like: >> > >> sqoop import --connect >> > >> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username >> > >> <usr> >> > >> -P --query <query>' --split-by <date-field> --hbase-table >> > >> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash" >> > >> >> > >> The strange thing is that it takes a lot to complete the last part of >> > >> the map. This is part of the log: >> > >> >> > >> [...] >> > >> 12/09/04 17:16:45 INFO mapred.JobClient: Running job: >> > >> job_201209031227_0007 >> > >> 12/09/04 17:16:46 INFO mapred.JobClient: map 0% reduce 0% >> > >> 12/09/04 17:24:20 INFO mapred.JobClient: map 25% reduce 0% >> > >> 12/09/04 17:24:21 INFO mapred.JobClient: map 50% reduce 0% >> > >> 12/09/04 17:24:23 INFO mapred.JobClient: map 75% reduce 0% >> > >> >> > >> As you can see it does not take much time to from start to 75%, but >> > >> the last part hasn't been finished (although it is working by a day >> > >> continuously). >> > >> Is there something wrong? I've tried to take a look to the logs but >> > >> it >> > >> seems to be ok. >> > >> >> > >> >> > >> Thanks, >> > >> Alberto >> > >> >> > >> >> > >> >> > >> -- >> > >> Alberto Cordioli >> > >> > >> > >> > -- >> > Alberto Cordioli > > -- Alberto Cordioli
