RE: Bug in CsvBulkLoad tool?

Riesland, Zack Thu, 25 Jun 2015 11:48:14 -0700

OH!

So even though I can still see my job running in the YARN resource manager 
after losing my SSH window, THAT job is simply to create the HFiles. But they 
won’t actually be written because the session is gone.

Is that correct?

That would clear up a lot of confusion…

From: Gabriel Reid [mailto:[email protected]]
Sent: Thursday, June 25, 2015 2:44 PM
To: [email protected]
Subject: Re: Bug in CsvBulkLoad tool?

Hi Zack,

The job counters are available in the YARN resource manager and/or YARN 
historyserver web interface. Your import job will have an entry in that web 
interface, and you can then click through to the full list of counters for the 
job.

Something else that you mentioned has made me think of another possible cause 
for what's going on here. If I understand correctly, you're connecting via ssh 
to a gateway machine to launch your job, and then sometimes you lose your ssh 
connection while the job is running. If this is indeed the case, the output 
from the job won't be written to HBase.

The CsvBulkLoadTool works by writing data to HFiles first, and then once all 
the MapReduce work is done, these HFiles are handed off to HBase. If your ssh 
connection closes before the full process has completed, the HFiles won't be 
handed off to HBase.

Is it possible that something like this is going on? If so, I recommend you 
start your job up from within screen [1] or tmux if they're available to you so 
that your terminal session doesn't end even if you drop your ssh connection.

- Gabriel

On Thu, Jun 25, 2015 at 8:27 PM Riesland, Zack 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Gabriel,

Then perhaps I discovered something interesting.

After my last email, I created a new table with the exact same script, except I 
changed the name of the table and the name of the primary key.

Then, I ran the CsvBulkLoad tool using the same folder with the 35 CSV files.

This time, my table has 1,479,451,977 rows, as expected.

Just like before, there were 1042 mappers, 1792 reducers, etc.

It might be interesting to write a simple test like this:

Create table X with primary key name ‘X’.

Ingest some data.

Create table Y with primary key name ‘X’.

Ingest the same data.

See whether the expected results are achieved.

One clarification: if I run the CsvBulkLoad tool at the command line and have 
the same SSH window open when it finishes, it is easy to see all the statistics.

But where I can find this data in the logs? Since these ingests can take 
several hours, I sometimes lose my VPN connection and my SSH window goes stale.

Thanks!

From: Gabriel Reid 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Thursday, June 25, 2015 2:18 PM

To: [email protected]<mailto:[email protected]>
Subject: Re: Bug in CsvBulkLoad tool?

Hi Zack,

No, you don't need to worry about the name of the primary key getting in the 
way of the rows being added.

Like Anil pointed out, the best thing to look at first is the job counters. The 
relevant ones for debugging this situation are the total map inputs and total 
map outputs, total reduce inputs and total reduce outputs, as well as reduce 
input groups, and finally the PhoenixJobCounters (INPUT_RECORDS, 
FAILED_RECORDS, and OUTPUT_RECORDS).

The INPUT_RECORDS and OUTPUT_RECORDS should both be around the number of rows 
that you expected (i.e. 1.7 million), along with map input records. If I 
remember correctly, the reduce input groups should be around the same value as 
well.

Could you post the values that you've got on those counters?

- Gabriel

On Thu, Jun 25, 2015 at 4:41 PM Riesland, Zack 
<[email protected]<mailto:[email protected]>> wrote:
I started writing a long response, and then noticed something:

When I created my new table, I copied/pasted the script and made some changes, 
but didn’t change the name of the primary key.

Is it possible that any row being inserted on the new table with a key that 
matches a row in the OTHER table is being thrown away?

From: anil gupta [mailto:[email protected]<mailto:[email protected]>]
Sent: Thursday, June 25, 2015 10:20 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Bug in CsvBulkLoad tool?

Hi Zack,
Can you share counters of csvbulkload job? Also, did you run one csvbulkload 
job or 35 bulkload job? Whats the schema of phoenix table? How are you making 
sure that you have no duplicate rowkey in your dataset?
If you have duplicate rowkeys. Then cells in that row in HBase will have more 
than 1 version. That is something i would check on HBase side to investigate 
this problem.

Thanks,
Anil Gupta

On Thu, Jun 25, 2015 at 3:11 AM, Riesland, Zack 
<[email protected]<mailto:[email protected]>> wrote:
Earlier this week I was surprised to find that, after dumping tons of data from 
a Hive table to an HBase table, about half of the data didn’t end up in HBase.

So, yesterday, I created a new Phoenix table.

This time, I’m splitting on the first 6 characters of the key, which gives me 
about 1700 regions (across 6 fairly beefy region servers).

My 7 billion Hive rows live in 125 5GB csv files on HDFS.

I copied 35 of them to a separate folder, and ran the CsvBuolkLoad tool against 
that folder.

The application manager tells me that the job ran to completion. 1042/1042 
successful maps and 1792/1792 successful reduces.

However, when I run the mapreduce.RowCounter against the new table, it only 
shows about 300 million rows.

I should see 35/125 * 7 billion = ~ 1.7 billion rows.

These are not primary key collisions.

Can someone please help me understand what is going on?

--
Thanks & Regards,
Anil Gupta

RE: Bug in CsvBulkLoad tool?

Reply via email to