Re: Spark HBase Bulk load using HFileFormat

2016-07-13 Thread Ted Yu
Can you show the code inside saveASHFile ?

Maybe the partitions of the RDD need to be sorted (for 1st issue).

Cheers

On Wed, Jul 13, 2016 at 4:29 PM, yeshwanth kumar 
wrote:

> Hi i am doing bulk load into HBase as HFileFormat, by
> using saveAsNewAPIHadoopFile
>
> i am on HBase 1.2.0-cdh5.7.0 and spark 1.6
>
> when i try to write i am getting an exception
>
>  java.io.IOException: Added a key not lexically larger than previous.
>
> following is the code snippet
>
> case class HBaseRow(rowKey: ImmutableBytesWritable, kv: KeyValue)
>
> val kAvroDF =
> sqlContext.read.format("com.databricks.spark.avro").load(args(0))
> val kRDD = kAvroDF.select("seqid", "mi", "moc", "FID", "WID").rdd
> val trRDD = kRDD.map(a => preparePUT(a(1).asInstanceOf[String],
> a(3).asInstanceOf[String]))
> val kvRDD = trRDD.flatMap(a => a).map(a => (a.rowKey, a.kv))
> saveAsHFile(kvRDD, args(1))
>
>
> prepare put returns a list of HBaseRow( ImmutableBytesWritable,KeyValue)
> sorted on KeyValue, where i do a flat map on the rdd and
> prepare a RDD(ImmutableBytesWritable,KeyValue) and pass it to saveASHFile
>
> i tried using Put api,
> it throws
>
> java.lang.Exception: java.lang.ClassCastException:
> org.apache.hadoop.hbase.client.Put cannot be cast to
> org.apache.hadoop.hbase.Cell
>
>
> is there any i can skip using KeyValue Api,
> and do the bulk load into HBase?
> please help me in resolving this issue,
>
> Thanks,
> -Yeshwanth
>


Spark HBase Bulk load using HFileFormat

2016-07-13 Thread yeshwanth kumar
Hi i am doing bulk load into HBase as HFileFormat, by
using saveAsNewAPIHadoopFile

i am on HBase 1.2.0-cdh5.7.0 and spark 1.6

when i try to write i am getting an exception

 java.io.IOException: Added a key not lexically larger than previous.

following is the code snippet

case class HBaseRow(rowKey: ImmutableBytesWritable, kv: KeyValue)

val kAvroDF = sqlContext.read.format("com.databricks.spark.avro").load(args(0))
val kRDD = kAvroDF.select("seqid", "mi", "moc", "FID", "WID").rdd
val trRDD = kRDD.map(a => preparePUT(a(1).asInstanceOf[String],
a(3).asInstanceOf[String]))
val kvRDD = trRDD.flatMap(a => a).map(a => (a.rowKey, a.kv))
saveAsHFile(kvRDD, args(1))


prepare put returns a list of HBaseRow( ImmutableBytesWritable,KeyValue)
sorted on KeyValue, where i do a flat map on the rdd and
prepare a RDD(ImmutableBytesWritable,KeyValue) and pass it to saveASHFile

i tried using Put api,
it throws

java.lang.Exception: java.lang.ClassCastException:
org.apache.hadoop.hbase.client.Put cannot be cast to
org.apache.hadoop.hbase.Cell


is there any i can skip using KeyValue Api,
and do the bulk load into HBase?
please help me in resolving this issue,

Thanks,
-Yeshwanth


Re: is possible to create multiple TableSplit per region?

2016-07-13 Thread Billy Watson
I agree. I'm not an expert though. I do more pig jobs than anything.

Any one else on the thread have more experience creating MR jobs on HBase
data?

On Wednesday, July 13, 2016, Frank Luo  wrote:

> It will work, but it is pretty awkward way to create more mappers.
>
>
>
> *From:* Billy Watson [mailto:williamrwat...@gmail.com
> ]
> *Sent:* Wednesday, July 13, 2016 3:57 PM
> *To:* Frank Luo  >
> *Cc:* user@hbase.apache.org
> 
> *Subject:* Re: Re:is possible to create multiple TableSplit per region?
>
>
>
> It seems like it might be faster then to consider a map job followed by
> another map job. Or, depending on the web service calls, maybe a combine
> step?
>
>
> William Watson
> Lead Software Engineer
>
>
>
> On Wed, Jul 13, 2016 at 4:40 PM, Frank Luo  > wrote:
>
> It makes a number of web-service calls.
>
>
>
> *From:* Billy Watson [mailto:williamrwat...@gmail.com
> ]
> *Sent:* Wednesday, July 13, 2016 3:27 PM
> *To:* user@hbase.apache.org
> 
> *Cc:* Frank Luo  >
> *Subject:* Re: Re:is possible to create multiple TableSplit per region?
>
>
>
> What do you mean by "heavy work downstream"?
>
>
>
> I think the mailing list might need a *few* more details to help out
> better.
>
>
> William Watson
>
>
>
> On Wed, Jul 13, 2016 at 12:32 PM, Frank Luo  > wrote:
>
> Thanks for the prompt reply, Lu.
>
> It is true that having a smaller region file size can solve the problem.
> But it also have side effects. For example, total number of regions can be
> easily doubled/tripled, and I am already facing a challenge of having too
> many regions per server. So I cannot go to that route.
>
> From: 陆巍 [mailto:luwei...@163.com
> ]
> Sent: Wednesday, July 13, 2016 11:24 AM
> To: user@hbase.apache.org
> ; Frank Luo <
> j...@merkleinc.com >
> Subject: Re:is possible to create multiple TableSplit per region?
>
> here is an archived mail:
> http://mail-archives.apache.org/mod_mbox/hbase-user/201303.mbox/%3cblu0-smtp19115a8967869d6cf0d49ef8f...@phx.gbl%3E
>
> At 2016-07-13 23:20:28, "Frank Luo"   j...@merkleinc.com >>
> wrote:
>
> >We have mapper only jobs operating on a result of a Scan. Because of
> heavy work downstream, the mapper runs fairly slowly. So I am wondering if
> there is a way to create multiple TableSplit on one region hence multiple
> mappers can be created to work on different piece of date on the region.
>
> >
>
> >I am aware of MultithreadedTableMapper class, which could be my solution,
> but I hesitate to use it as my code is not thread safe.
>
> >
>
> >So any suggestions, or code to share?
>
> >
>
> >
>
> >
>
> >Download the latest installment of our annual Marketing Imperatives,
> “Winning with People-Based Marketing”<
> http://www2.merkleinc.com/l/47252/2016-04-26/3lbfd1>
>
> >
>
> >This email and any attachments transmitted with it are intended for use
> by the intended recipient(s) only. If you have received this email in
> error, please notify the sender immediately and then delete it. If you are
> not the intended recipient, you must not keep, use, disclose, copy or
> distribute this email without the author’s prior permission. We take
> precautions to minimize the risk of transmitting software viruses, but we
> advise you to perform your own virus checks on any attachment to this
> message. We cannot accept liability for any loss or damage caused by
> software viruses. The information contained in this communication may be
> confidential and may be subject to the attorney-client privilege.
>
>
>
>
> Download the latest installment of our annual Marketing Imperatives,
> “Winning with People-Based Marketing”<
> http://www2.merkleinc.com/l/47252/2016-04-26/3lbfd1>
>
> This email and any attachments transmitted with it are intended for use by
> the intended recipient(s) only. If you have received this email in error,
> please notify the sender immediately and then delete it. If you are not the
> intended recipient, you must not keep, use, disclose, copy or distribute
> this email without the author’s prior permission. We take precautions to
> minimize the risk of transmitting software viruses, but we advise you to
> perform your own virus checks on any attachment to this message. We cannot
> 

RE: Re:is possible to create multiple TableSplit per region?

2016-07-13 Thread Frank Luo
It will work, but it is pretty awkward way to create more mappers.

From: Billy Watson [mailto:williamrwat...@gmail.com]
Sent: Wednesday, July 13, 2016 3:57 PM
To: Frank Luo 
Cc: user@hbase.apache.org
Subject: Re: Re:is possible to create multiple TableSplit per region?

It seems like it might be faster then to consider a map job followed by another 
map job. Or, depending on the web service calls, maybe a combine step?

William Watson
Lead Software Engineer

On Wed, Jul 13, 2016 at 4:40 PM, Frank Luo 
> wrote:
It makes a number of web-service calls.

From: Billy Watson 
[mailto:williamrwat...@gmail.com]
Sent: Wednesday, July 13, 2016 3:27 PM
To: user@hbase.apache.org
Cc: Frank Luo >
Subject: Re: Re:is possible to create multiple TableSplit per region?

What do you mean by "heavy work downstream"?

I think the mailing list might need a *few* more details to help out better.

William Watson

On Wed, Jul 13, 2016 at 12:32 PM, Frank Luo 
> wrote:
Thanks for the prompt reply, Lu.

It is true that having a smaller region file size can solve the problem. But it 
also have side effects. For example, total number of regions can be easily 
doubled/tripled, and I am already facing a challenge of having too many regions 
per server. So I cannot go to that route.

From: 陆巍 [mailto:luwei...@163.com]
Sent: Wednesday, July 13, 2016 11:24 AM
To: user@hbase.apache.org; Frank Luo 
>
Subject: Re:is possible to create multiple TableSplit per region?

here is an archived mail: 
http://mail-archives.apache.org/mod_mbox/hbase-user/201303.mbox/%3cblu0-smtp19115a8967869d6cf0d49ef8f...@phx.gbl%3E

At 2016-07-13 23:20:28, "Frank Luo" 
>>
 wrote:

>We have mapper only jobs operating on a result of a Scan. Because of heavy 
>work downstream, the mapper runs fairly slowly. So I am wondering if there is 
>a way to create multiple TableSplit on one region hence multiple mappers can 
>be created to work on different piece of date on the region.

>

>I am aware of MultithreadedTableMapper class, which could be my solution, but 
>I hesitate to use it as my code is not thread safe.

>

>So any suggestions, or code to share?

>

>

>

>Download the latest installment of our annual Marketing Imperatives, “Winning 
>with People-Based 
>Marketing”

>

>This email and any attachments transmitted with it are intended for use by the 
>intended recipient(s) only. If you have received this email in error, please 
>notify the sender immediately and then delete it. If you are not the intended 
>recipient, you must not keep, use, disclose, copy or distribute this email 
>without the author’s prior permission. We take precautions to minimize the 
>risk of transmitting software viruses, but we advise you to perform your own 
>virus checks on any attachment to this message. We cannot accept liability for 
>any loss or damage caused by software viruses. The information contained in 
>this communication may be confidential and may be subject to the 
>attorney-client privilege.




Download the latest installment of our annual Marketing Imperatives, “Winning 
with People-Based 
Marketing”

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.


Download the latest installment of our annual Marketing Imperatives, “Winning 
with People-Based 
Marketing”

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any 

Re: Re:is possible to create multiple TableSplit per region?

2016-07-13 Thread Billy Watson
It seems like it might be faster then to consider a map job followed by
another map job. Or, depending on the web service calls, maybe a combine
step?

William Watson
Lead Software Engineer

On Wed, Jul 13, 2016 at 4:40 PM, Frank Luo  wrote:

> It makes a number of web-service calls.
>
>
>
> *From:* Billy Watson [mailto:williamrwat...@gmail.com]
> *Sent:* Wednesday, July 13, 2016 3:27 PM
> *To:* user@hbase.apache.org
> *Cc:* Frank Luo 
> *Subject:* Re: Re:is possible to create multiple TableSplit per region?
>
>
>
> What do you mean by "heavy work downstream"?
>
>
>
> I think the mailing list might need a *few* more details to help out
> better.
>
>
> William Watson
>
>
>
> On Wed, Jul 13, 2016 at 12:32 PM, Frank Luo  wrote:
>
> Thanks for the prompt reply, Lu.
>
> It is true that having a smaller region file size can solve the problem.
> But it also have side effects. For example, total number of regions can be
> easily doubled/tripled, and I am already facing a challenge of having too
> many regions per server. So I cannot go to that route.
>
> From: 陆巍 [mailto:luwei...@163.com]
> Sent: Wednesday, July 13, 2016 11:24 AM
> To: user@hbase.apache.org; Frank Luo 
> Subject: Re:is possible to create multiple TableSplit per region?
>
> here is an archived mail:
> http://mail-archives.apache.org/mod_mbox/hbase-user/201303.mbox/%3cblu0-smtp19115a8967869d6cf0d49ef8f...@phx.gbl%3E
>
>
> At 2016-07-13 23:20:28, "Frank Luo" > wrote:
>
> >We have mapper only jobs operating on a result of a Scan. Because of
> heavy work downstream, the mapper runs fairly slowly. So I am wondering if
> there is a way to create multiple TableSplit on one region hence multiple
> mappers can be created to work on different piece of date on the region.
>
> >
>
> >I am aware of MultithreadedTableMapper class, which could be my solution,
> but I hesitate to use it as my code is not thread safe.
>
> >
>
> >So any suggestions, or code to share?
>
> >
>
> >
>
> >
>
> >Download the latest installment of our annual Marketing Imperatives,
> “Winning with People-Based Marketing”<
> http://www2.merkleinc.com/l/47252/2016-04-26/3lbfd1>
>
> >
>
> >This email and any attachments transmitted with it are intended for use
> by the intended recipient(s) only. If you have received this email in
> error, please notify the sender immediately and then delete it. If you are
> not the intended recipient, you must not keep, use, disclose, copy or
> distribute this email without the author’s prior permission. We take
> precautions to minimize the risk of transmitting software viruses, but we
> advise you to perform your own virus checks on any attachment to this
> message. We cannot accept liability for any loss or damage caused by
> software viruses. The information contained in this communication may be
> confidential and may be subject to the attorney-client privilege.
>
>
>
>
> Download the latest installment of our annual Marketing Imperatives,
> “Winning with People-Based Marketing”<
> http://www2.merkleinc.com/l/47252/2016-04-26/3lbfd1>
>
> This email and any attachments transmitted with it are intended for use by
> the intended recipient(s) only. If you have received this email in error,
> please notify the sender immediately and then delete it. If you are not the
> intended recipient, you must not keep, use, disclose, copy or distribute
> this email without the author’s prior permission. We take precautions to
> minimize the risk of transmitting software viruses, but we advise you to
> perform your own virus checks on any attachment to this message. We cannot
> accept liability for any loss or damage caused by software viruses. The
> information contained in this communication may be confidential and may be
> subject to the attorney-client privilege.
>
>
>
> *Download the latest installment of our annual Marketing Imperatives,
> “Winning with People-Based Marketing”*
> 
>
> This email and any attachments transmitted with it are intended for use by
> the intended recipient(s) only. If you have received this email in error,
> please notify the sender immediately and then delete it. If you are not the
> intended recipient, you must not keep, use, disclose, copy or distribute
> this email without the author’s prior permission. We take precautions to
> minimize the risk of transmitting software viruses, but we advise you to
> perform your own virus checks on any attachment to this message. We cannot
> accept liability for any loss or damage caused by software viruses. The
> information contained in this communication may be confidential and may be
> subject to the attorney-client privilege.
>


RE: Re:is possible to create multiple TableSplit per region?

2016-07-13 Thread Frank Luo
It makes a number of web-service calls.

From: Billy Watson [mailto:williamrwat...@gmail.com]
Sent: Wednesday, July 13, 2016 3:27 PM
To: user@hbase.apache.org
Cc: Frank Luo 
Subject: Re: Re:is possible to create multiple TableSplit per region?

What do you mean by "heavy work downstream"?

I think the mailing list might need a *few* more details to help out better.

William Watson

On Wed, Jul 13, 2016 at 12:32 PM, Frank Luo 
> wrote:
Thanks for the prompt reply, Lu.

It is true that having a smaller region file size can solve the problem. But it 
also have side effects. For example, total number of regions can be easily 
doubled/tripled, and I am already facing a challenge of having too many regions 
per server. So I cannot go to that route.

From: 陆巍 [mailto:luwei...@163.com]
Sent: Wednesday, July 13, 2016 11:24 AM
To: user@hbase.apache.org; Frank Luo 
>
Subject: Re:is possible to create multiple TableSplit per region?

here is an archived mail: 
http://mail-archives.apache.org/mod_mbox/hbase-user/201303.mbox/%3cblu0-smtp19115a8967869d6cf0d49ef8f...@phx.gbl%3E


At 2016-07-13 23:20:28, "Frank Luo" 
>>
 wrote:

>We have mapper only jobs operating on a result of a Scan. Because of heavy 
>work downstream, the mapper runs fairly slowly. So I am wondering if there is 
>a way to create multiple TableSplit on one region hence multiple mappers can 
>be created to work on different piece of date on the region.

>

>I am aware of MultithreadedTableMapper class, which could be my solution, but 
>I hesitate to use it as my code is not thread safe.

>

>So any suggestions, or code to share?

>

>

>

>Download the latest installment of our annual Marketing Imperatives, “Winning 
>with People-Based 
>Marketing”

>

>This email and any attachments transmitted with it are intended for use by the 
>intended recipient(s) only. If you have received this email in error, please 
>notify the sender immediately and then delete it. If you are not the intended 
>recipient, you must not keep, use, disclose, copy or distribute this email 
>without the author’s prior permission. We take precautions to minimize the 
>risk of transmitting software viruses, but we advise you to perform your own 
>virus checks on any attachment to this message. We cannot accept liability for 
>any loss or damage caused by software viruses. The information contained in 
>this communication may be confidential and may be subject to the 
>attorney-client privilege.




Download the latest installment of our annual Marketing Imperatives, “Winning 
with People-Based 
Marketing”

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.


Download the latest installment of our annual Marketing Imperatives, “Winning 
with People-Based 
Marketing”

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.


Re: Re:is possible to create multiple TableSplit per region?

2016-07-13 Thread Billy Watson
What do you mean by "heavy work downstream"?

I think the mailing list might need a *few* more details to help out better.

William Watson

On Wed, Jul 13, 2016 at 12:32 PM, Frank Luo  wrote:

> Thanks for the prompt reply, Lu.
>
> It is true that having a smaller region file size can solve the problem.
> But it also have side effects. For example, total number of regions can be
> easily doubled/tripled, and I am already facing a challenge of having too
> many regions per server. So I cannot go to that route.
>
> From: 陆巍 [mailto:luwei...@163.com]
> Sent: Wednesday, July 13, 2016 11:24 AM
> To: user@hbase.apache.org; Frank Luo 
> Subject: Re:is possible to create multiple TableSplit per region?
>
> here is an archived mail:
> http://mail-archives.apache.org/mod_mbox/hbase-user/201303.mbox/%3cblu0-smtp19115a8967869d6cf0d49ef8f...@phx.gbl%3E
>
>
>
> At 2016-07-13 23:20:28, "Frank Luo" > wrote:
>
> >We have mapper only jobs operating on a result of a Scan. Because of
> heavy work downstream, the mapper runs fairly slowly. So I am wondering if
> there is a way to create multiple TableSplit on one region hence multiple
> mappers can be created to work on different piece of date on the region.
>
> >
>
> >I am aware of MultithreadedTableMapper class, which could be my solution,
> but I hesitate to use it as my code is not thread safe.
>
> >
>
> >So any suggestions, or code to share?
>
> >
>
> >
>
> >
>
> >Download the latest installment of our annual Marketing Imperatives,
> “Winning with People-Based Marketing”<
> http://www2.merkleinc.com/l/47252/2016-04-26/3lbfd1>
>
> >
>
> >This email and any attachments transmitted with it are intended for use
> by the intended recipient(s) only. If you have received this email in
> error, please notify the sender immediately and then delete it. If you are
> not the intended recipient, you must not keep, use, disclose, copy or
> distribute this email without the author’s prior permission. We take
> precautions to minimize the risk of transmitting software viruses, but we
> advise you to perform your own virus checks on any attachment to this
> message. We cannot accept liability for any loss or damage caused by
> software viruses. The information contained in this communication may be
> confidential and may be subject to the attorney-client privilege.
>
>
>
>
> Download the latest installment of our annual Marketing Imperatives,
> “Winning with People-Based Marketing”<
> http://www2.merkleinc.com/l/47252/2016-04-26/3lbfd1>
>
> This email and any attachments transmitted with it are intended for use by
> the intended recipient(s) only. If you have received this email in error,
> please notify the sender immediately and then delete it. If you are not the
> intended recipient, you must not keep, use, disclose, copy or distribute
> this email without the author’s prior permission. We take precautions to
> minimize the risk of transmitting software viruses, but we advise you to
> perform your own virus checks on any attachment to this message. We cannot
> accept liability for any loss or damage caused by software viruses. The
> information contained in this communication may be confidential and may be
> subject to the attorney-client privilege.
>


RE: Re:is possible to create multiple TableSplit per region?

2016-07-13 Thread Frank Luo
Thanks for the prompt reply, Lu.

It is true that having a smaller region file size can solve the problem. But it 
also have side effects. For example, total number of regions can be easily 
doubled/tripled, and I am already facing a challenge of having too many regions 
per server. So I cannot go to that route.

From: 陆巍 [mailto:luwei...@163.com]
Sent: Wednesday, July 13, 2016 11:24 AM
To: user@hbase.apache.org; Frank Luo 
Subject: Re:is possible to create multiple TableSplit per region?

here is an archived mail: 
http://mail-archives.apache.org/mod_mbox/hbase-user/201303.mbox/%3cblu0-smtp19115a8967869d6cf0d49ef8f...@phx.gbl%3E



At 2016-07-13 23:20:28, "Frank Luo" 
> wrote:

>We have mapper only jobs operating on a result of a Scan. Because of heavy 
>work downstream, the mapper runs fairly slowly. So I am wondering if there is 
>a way to create multiple TableSplit on one region hence multiple mappers can 
>be created to work on different piece of date on the region.

>

>I am aware of MultithreadedTableMapper class, which could be my solution, but 
>I hesitate to use it as my code is not thread safe.

>

>So any suggestions, or code to share?

>

>

>

>Download the latest installment of our annual Marketing Imperatives, “Winning 
>with People-Based 
>Marketing”

>

>This email and any attachments transmitted with it are intended for use by the 
>intended recipient(s) only. If you have received this email in error, please 
>notify the sender immediately and then delete it. If you are not the intended 
>recipient, you must not keep, use, disclose, copy or distribute this email 
>without the author’s prior permission. We take precautions to minimize the 
>risk of transmitting software viruses, but we advise you to perform your own 
>virus checks on any attachment to this message. We cannot accept liability for 
>any loss or damage caused by software viruses. The information contained in 
>this communication may be confidential and may be subject to the 
>attorney-client privilege.




Download the latest installment of our annual Marketing Imperatives, “Winning 
with People-Based 
Marketing”

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.


Re:is possible to create multiple TableSplit per region?

2016-07-13 Thread 陆巍
here is an archived mail: 
http://mail-archives.apache.org/mod_mbox/hbase-user/201303.mbox/%3cblu0-smtp19115a8967869d6cf0d49ef8f...@phx.gbl%3E





At 2016-07-13 23:20:28, "Frank Luo"  wrote:
>We have mapper only jobs operating on a result of a Scan. Because of heavy 
>work downstream, the mapper runs fairly slowly. So I am wondering if there is 
>a way to create multiple TableSplit on one region hence multiple mappers can 
>be created to work on different piece of date on the region.
>
>I am aware of MultithreadedTableMapper class, which could be my solution, but 
>I hesitate to use it as my code is not thread safe.
>
>So any suggestions, or code to share?
>
>
>
>Download the latest installment of our annual Marketing Imperatives, “Winning 
>with People-Based 
>Marketing”
>
>This email and any attachments transmitted with it are intended for use by the 
>intended recipient(s) only. If you have received this email in error, please 
>notify the sender immediately and then delete it. If you are not the intended 
>recipient, you must not keep, use, disclose, copy or distribute this email 
>without the author’s prior permission. We take precautions to minimize the 
>risk of transmitting software viruses, but we advise you to perform your own 
>virus checks on any attachment to this message. We cannot accept liability for 
>any loss or damage caused by software viruses. The information contained in 
>this communication may be confidential and may be subject to the 
>attorney-client privilege.


is possible to create multiple TableSplit per region?

2016-07-13 Thread Frank Luo
We have mapper only jobs operating on a result of a Scan. Because of heavy work 
downstream, the mapper runs fairly slowly. So I am wondering if there is a way 
to create multiple TableSplit on one region hence multiple mappers can be 
created to work on different piece of date on the region.

I am aware of MultithreadedTableMapper class, which could be my solution, but I 
hesitate to use it as my code is not thread safe.

So any suggestions, or code to share?



Download the latest installment of our annual Marketing Imperatives, “Winning 
with People-Based 
Marketing”

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.


unable to write data to hbase without any error

2016-07-13 Thread 罗辉
Hello guys
I have a spark-sql app which writes some data to hbase, however this app hangs 
without any exception or error.
Here is my code:
//code base :https://hbase.apache.org/book.html#scala
val sparkMasterUrlDev = "spark://master60:7077"
val sparkMasterUrlLocal = "local[2]"

val sparkConf = new 
SparkConf().setAppName("HbaseConnector2").setMaster(sparkMasterUrlDev).set("spark.executor.memory",
 "10g")

val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val hivetable = hc.sql("select * from house_id_city_pv_range")
hivetable.persist()

val c = new CalendarTool
val yesterday = c.getDate
val stringA = ""

hivetable.repartition(6).foreachPartition { y =>
  println("")
  val conf = new HBaseConfiguration()
  conf.set("hbase.zookeeper.quorum", "master60,slave61,slave62")
  conf.set("hbase.zookeeper.property.clientPort", "2181")
  conf.set("hbase.rootdir", "hdfs://master60:10001/hbase")
  val table = new HTable(conf, "id_pv")
  val connection = ConnectionFactory.createConnection(conf)
  val admin = connection.getAdmin()
  y.foreach { x =>
val rowkeyp1 = x.getInt(2).toString()
val rowkeyp2 = stringA.substring(0, 8 - rowkeyp1.length())
val rowkeyp3 = x.getInt(1).toString()
val rowkeyp4 = stringA.substring(0, 8 - rowkeyp3.length())
val rowkeyp5 = yesterday
val rowkey = rowkeyp1 + rowkeyp2 + rowkeyp3 + rowkeyp4 + rowkeyp5
val theput = new Put(Bytes.toBytes(yesterday))
theput.add(Bytes.toBytes("id"), Bytes.toBytes(x.getInt(0).toString()), 
Bytes.toBytes(x.getInt(0)))
theput.add(Bytes.toBytes("pv"), Bytes.toBytes(x.getInt(3).toString()), 
Bytes.toBytes(x.getInt(3)))
table.put(theput)
  }
}
Last 20 lines of My Spark APP log:
16/07/13 17:18:33 INFO DAGScheduler: looking for newly runnable stages
16/07/13 17:18:33 INFO DAGScheduler: running: Set()
16/07/13 17:18:33 INFO DAGScheduler: waiting: Set(ResultStage 1)
16/07/13 17:18:33 INFO DAGScheduler: failed: Set()
16/07/13 17:18:33 INFO DAGScheduler: Submitting ResultStage 1 
(MapPartitionsRDD[5] at foreachPartition at HbaseConnector2.scala:33), which 
has no missing parents
16/07/13 17:18:33 INFO MemoryStore: Block broadcast_2 stored as values in 
memory (estimated size 4.0 KB, free 101.0 KB)
16/07/13 17:18:33 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in 
memory (estimated size 2.3 KB, free 103.4 KB)
16/07/13 17:18:33 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 
10.0.10.60:48953 (size: 2.3 KB, free: 4.1 GB)
16/07/13 17:18:33 INFO SparkContext: Created broadcast 2 from broadcast at 
DAGScheduler.scala:1006
16/07/13 17:18:33 INFO DAGScheduler: Submitting 6 missing tasks from 
ResultStage 1 (MapPartitionsRDD[5] at foreachPartition at 
HbaseConnector2.scala:33)
16/07/13 17:18:33 INFO TaskSchedulerImpl: Adding task set 1.0 with 6 tasks
16/07/13 17:18:33 INFO FairSchedulableBuilder: Added task set TaskSet_1 tasks 
to pool default
16/07/13 17:18:33 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, 
slave62, partition 0,NODE_LOCAL, 2430 bytes)
16/07/13 17:18:33 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, 
slave62, partition 1,NODE_LOCAL, 2430 bytes)
16/07/13 17:18:33 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, 
slave62, partition 2,NODE_LOCAL, 2430 bytes)
16/07/13 17:18:33 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 
slave62:38108 (size: 2.3 KB, free: 7.0 GB)
16/07/13 17:18:33 INFO MapOutputTrackerMasterEndpoint: Asked to send map output 
locations for shuffle 0 to slave62:52360
16/07/13 17:18:33 INFO MapOutputTrackerMaster: Size of output statuses for 
shuffle 0 is 138 bytes
16/07/13 17:18:36 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 4, 
slave61, partition 3,ANY, 2430 bytes)
16/07/13 17:18:36 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 5, 
master60, partition 4,ANY, 2430 bytes)
16/07/13 17:18:36 INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 6, 
slave61, partition 5,ANY, 2430 bytes)
16/07/13 17:18:37 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 
master60:56270 (size: 2.3 KB, free: 7.0 GB)
16/07/13 17:18:37 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 
slave61:47971 (size: 2.3 KB, free: 7.0 GB)
16/07/13 17:18:38 INFO MapOutputTrackerMasterEndpoint: Asked to send map output 
locations for shuffle 0 to master60:33085
16/07/13 17:18:38 INFO MapOutputTrackerMasterEndpoint: Asked to send map output 
locations for shuffle 0 to slave61:36961

The log in regionserver:
2016-07-13 17:27:44,189 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl: 
Stopping HBase metrics system...
2016-07-13 17:27:44,191 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase 
metrics system stopped.
2016-07-13 17:27:44,692