Perfect. Thanks for the help! Cheers, Alberto
On 5 September 2012 16:50, abhijeet gaikwad <[email protected]> wrote: > I forgot mentioning but calculations are done for maximum first 8 chars of > the string. So computation wise you are safe - but that may generate > in-accurate splits for some scenarios. > Anyways, I feel Jarcec's solutions are a better workaround :) > > Thanks, > Abhijeet > > On 5 Sep 2012 20:00, "Alberto Cordioli" <[email protected]> wrote: >> >> Mmh, I see. Hence I should avoid split by string fields, since my hash >> field is 72 char long and it requires a lot of computations (if I >> understood correctly). >> I think one of the solutions proposed by Jarcec could be ok. >> I also think I'll divide my big table in more little chunks, since the >> problem is the query that determine the split points. What do you >> think? >> >> Cheers, >> Alberto >> >> >> On 5 September 2012 15:21, abhijeet gaikwad <[email protected]> >> wrote: >> > Hi Alberto, >> > Sqoop calculates split points by converting min. and max. string values >> > returned by DBMS for the column into respective decimal values using an >> > algo. and then use Decimal splitter. There is a complex algorithm used >> > for >> > converting(more of mapping) string to decimal. This would help you >> > understand better (taken from the java docs for the split method of text >> > splitter): >> > >> > /** >> > * This method needs to determine the splits between two user-provided >> > * strings. In the case where the user's strings are 'A' and 'Z', >> > this is >> > * not hard; we could create two splits from ['A', 'M') and ['M', >> > 'Z'], 26 >> > * splits for strings beginning with each letter, etc. >> > * >> > * If a user has provided us with the strings "Ham" and "Haze", >> > however, >> > we >> > * need to create splits that differ in the third letter. >> > * >> > * The algorithm used is as follows: >> > * Since there are 2**16 unicode characters, we interpret characters >> > as >> > * digits in base 65536. Given a string 's' containing characters s_0, >> > s_1 >> > * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n >> > in >> > * base 65536. Having mapped the low and high strings into >> > floating-point >> > * values, we then use the BigDecimalSplitter to establish the even >> > split >> > * points, then map the resulting floating point values back into >> > strings. >> > */ >> > public List<InputSplit> split(Configuration conf, ResultSet results, >> > String colName) throws SQLException { >> > >> > After the splits are calculated, where clauses are used in SELECT >> > queries(i.e. - result is bounded by split points)) fired by each mapper >> > to >> > retrieve the data. >> > >> > From user perspective, you can use string for splitting except for >> > following >> > scenario: >> > Char split-by column is not recommended when the DBMS sorts in case >> > insensitive manner. The current algorithm used to calculate splits has >> > some >> > flaws. This is known and Sqoop displays a warning before executing the >> > job. >> > >> > Let me know if you need more details. >> > >> > Thanks, >> > Abhijeet >> > >> > >> > On Wed, Sep 5, 2012 at 5:13 PM, Jarek Jarcec Cecho <[email protected]> >> > wrote: >> >> >> >> Hi Alberto, >> >> I've never used text column for data splitting, however it seems that >> >> sqoop is supporting that (I found it's splitter in the code). However >> >> I'm >> >> still not sure if it's wise as string operations tends to be much >> >> slower on >> >> databases and you might end up with performance issues. Unfortunately >> >> Sqoop >> >> currently do not support any direct way how to affect split creation. >> >> >> >> I tried to think about your problem and came with two ideas how to help >> >> in >> >> your use case: >> >> >> >> 1) Would it be acceptable in your use case to change the zero date >> >> policy >> >> from zeroDateTimeBehavior=round to zeroDateTimeBehavior=convertToNull? >> >> In >> >> case that "split" column contains nulls, sqoop will create X+1 splits >> >> where >> >> the +1 will cover all NULL values. It probably won't be the best, but >> >> it >> >> might help to distribute your load more properly. >> >> >> >> 2) What about splitting entire job into two parts - firstly export all >> >> zero dates and separately in next job the rest of the values. By doing >> >> so >> >> you might be able to get decent distribution across the "normal" dates >> >> part. >> >> Importing all the zero dates might be challenging if you have a lot of >> >> them >> >> as there will be only one value available (and thus just one split) and >> >> therefore you might need to use the text column for split creation in >> >> this >> >> case anyway. >> >> >> >> Jarcec >> >> >> >> On Wed, Sep 05, 2012 at 10:16:17AM +0200, Alberto Cordioli wrote: >> >> > Thanks Jarcec, >> >> > probably you've identified immediately the problem. In fact, I >> >> > checked >> >> > the date field, and I think problem is that in my data I have some >> >> > "limit" values like '0000-00-00' (damn who have inserted these). >> >> > The other data are equally distributed in 2 months (from 2012-04-01 >> >> > to >> >> > 2012-06-01): as you said with a parallelism of 3, 2 mappers will take >> >> > basically no data while the other will do the "true" job, right? >> >> > >> >> > So, now my question becomes: the other field that I could use to >> >> > split >> >> > the job is an hash (string). How sqoop divide this type of field? >> >> > Lexicography order? >> >> > >> >> > >> >> > Alberto >> >> > >> >> > >> >> > >> >> > On 5 September 2012 09:57, Jarek Jarcec Cecho <[email protected]> >> >> > wrote: >> >> > > Hi Alberto, >> >> > > taking into account that you have 910 millions of records and >> >> > > you're >> >> > > job was able to get to 75% in matter of 8 minutes and then it slow >> >> > > down >> >> > > significantly, I do have a feeling that your splits were not >> >> > > equally >> >> > > divided. Based on your command line it seems that you're diving >> >> > > data by some >> >> > > date field. Is this date field uniformly distributed? E.g. is there >> >> > > roughly >> >> > > same number of rows for each date or do you have more rows in more >> >> > > recent >> >> > > days? >> >> > > >> >> > > Because Sqoop have no idea how exactly the data are distributed in >> >> > > your database, it assumes uniform distribution. Let me explain why >> >> > > it >> >> > > matters on following example. Let's consider table where there is >> >> > > one row on >> >> > > 2012-01-01, second row on 2012-02-01 and 1M of rows on 2012-03-01. >> >> > > Let's >> >> > > assume that we will use three mappers (--num-mappers 3). In this >> >> > > case, sqoop >> >> > > will create three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up >> >> > > to >> >> > > 2012-02-28 and 2012-03-01 up to 2012-03-31. Because the first two >> >> > > mappers do >> >> > > have just one row to move, they will finish almost instantly and >> >> > > get job to >> >> > > 66% done (2 out of 3 mappers are done), however the last mapper >> >> > > will be >> >> > > running for some time as it need to move 1M of rows. For external >> >> > > observer >> >> > > it would appear that the sqoop has stopped, but what really >> >> > > happened is just >> >> > > having not uniformly distributed data across all mappers. >> >> > > >> >> > > Jarcec >> >> > > >> >> > > On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote: >> >> > >> Hi all, >> >> > >> >> >> > >> I am using Sqoop to import a big MySql table (around 910 milions >> >> > >> of >> >> > >> records) in Hbase. >> >> > >> The command line that I'm using is something like: >> >> > >> sqoop import --connect >> >> > >> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username >> >> > >> <usr> >> >> > >> -P --query <query>' --split-by <date-field> --hbase-table >> >> > >> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash" >> >> > >> >> >> > >> The strange thing is that it takes a lot to complete the last part >> >> > >> of >> >> > >> the map. This is part of the log: >> >> > >> >> >> > >> [...] >> >> > >> 12/09/04 17:16:45 INFO mapred.JobClient: Running job: >> >> > >> job_201209031227_0007 >> >> > >> 12/09/04 17:16:46 INFO mapred.JobClient: map 0% reduce 0% >> >> > >> 12/09/04 17:24:20 INFO mapred.JobClient: map 25% reduce 0% >> >> > >> 12/09/04 17:24:21 INFO mapred.JobClient: map 50% reduce 0% >> >> > >> 12/09/04 17:24:23 INFO mapred.JobClient: map 75% reduce 0% >> >> > >> >> >> > >> As you can see it does not take much time to from start to 75%, >> >> > >> but >> >> > >> the last part hasn't been finished (although it is working by a >> >> > >> day >> >> > >> continuously). >> >> > >> Is there something wrong? I've tried to take a look to the logs >> >> > >> but >> >> > >> it >> >> > >> seems to be ok. >> >> > >> >> >> > >> >> >> > >> Thanks, >> >> > >> Alberto >> >> > >> >> >> > >> >> >> > >> >> >> > >> -- >> >> > >> Alberto Cordioli >> >> > >> >> > >> >> > >> >> > -- >> >> > Alberto Cordioli >> > >> > >> >> >> >> -- >> Alberto Cordioli -- Alberto Cordioli
