subject:"Re\: Spark and HBase"

Re: Spark submit hbase issue

2021-04-14 Thread Mich Talebzadeh

Try adding hbase-site.xml file to  %SPARK_HOME%\conf and see if it works


HTH




   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 14 Apr 2021 at 22:40, KhajaAsmath Mohammed 
wrote:

> Hi,
>
> Spark submit is connecting to local host instead of zookeeper mentioned in
> hbase-site.xml. This same program works in ide which gets connected to
> hbase-site.xml. What am I missing in spark submit?
> >
> > 
> > spark-submit --driver-class-path
> C:\Users\mdkha\bitbucket\clx-spark-scripts\src\test\resources\hbase-site.xml
> --files
> C:\Users\mdkha\bitbucket\clx-spark-scripts\src\test\resources\hbase-site.xml
> --conf
> "spark.driver.extraLibraryPath=C:\Users\mdkha\bitbucket\clx-spark-scripts\src\test\resources\hbase-site.xml"
> --executor-memory 4g --class com.drivers.HBASEExportToS3
> C:\Users\mdkha\bitbucket\clx-spark-scripts\target\clx-spark-scripts.jar -c
> C:\Users\mdkha\bitbucket\clx-spark-scripts\src\test\resources\job.properties
>
> >
> >
> > 21/04/14 16:13:36 WARN ClientCnxn: Session 0x0 for server null,
> unexpected error, closing socket connection and attempting reconnect
> > java.net.ConnectException: Connection refused: no further information
> > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> > at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)
> > at
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
> > at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> > 21/04/14 16:13:37 WARN ReadOnlyZKClient: 0x2d73767e to localhost:2181
> failed for get of /hbase/hbaseid, code = CONNECTIONLOSS, retries = 1
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark RDD + HBase: adoption trend

2021-01-20 Thread Sean Owen

RDDs are still relevant in a few ways - there is no Dataset in Python for
example, so RDD is still the 'typed' API. They still underpin DataFrames.
And of course it's still there because there's probably still a lot of code
out there that uses it. Occasionally it's still useful to drop into that
API for certain operations.

If that's a connector to read data from HBase - you probably do want to
return DataFrames ideally.
Unless you're relying on very specific APIs from very specific versions, I
wouldn't think a distro's Spark or HBase is much different?

On Wed, Jan 20, 2021 at 7:44 AM Marco Firrincieli 
wrote:

> Hi, my name is Marco and I'm one of the developers behind
> https://github.com/unicredit/hbase-rdd
> a project we are currently reviewing for various reasons.
>
> We were basically wondering if RDD "is still a thing" nowadays (we see
> lots of usage for DataFrames or Datasets) and we're not sure how much of
> the community still works/uses RDDs.
>
> Also, for lack of time, we always mainly worked using Cloudera-flavored
> Hadoop/HBase & Spark versions. We were thinking the community would then
> help us organize the project in a more "generic" way, but that didn't
> happen.
>
> So I figured I would ask here what is the gut feeling of the Spark
> community so to better define the future of our little library.
>
> Thanks
>
> -Marco
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark RDD + HBase: adoption trend

2021-01-20 Thread Jacek Laskowski

Hi Marco,

IMHO RDD is only for very sophisticated use cases that very few Spark devs
would be capable of. I consider RDD API a sort of Spark assembler and most
Spark devs should stick to Dataset API.

Speaking of HBase, see
https://github.com/GoogleCloudPlatform/java-docs-samples/tree/master/bigtable/spark
where you can find a demo that I worked on last year and made sure that:

"Apache HBase™ Spark Connector implements the DataSource API for Apache
HBase and allows executing relational queries on data stored in Cloud
Bigtable."

That makes hbase-rdd even more obsolete but not necessarily unusable (I am
little skilled in the HBase space to comment on this).

I think you should consider merging the project hbase-rdd of yours with the
official Apache HBase™ Spark Connector at
https://github.com/apache/hbase-connectors/tree/master/spark (as they seem
to lack active development IMHO).

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski

On Wed, Jan 20, 2021 at 2:44 PM Marco Firrincieli 
wrote:

> Hi, my name is Marco and I'm one of the developers behind
> https://github.com/unicredit/hbase-rdd
> a project we are currently reviewing for various reasons.
>
> We were basically wondering if RDD "is still a thing" nowadays (we see
> lots of usage for DataFrames or Datasets) and we're not sure how much of
> the community still works/uses RDDs.
>
> Also, for lack of time, we always mainly worked using Cloudera-flavored
> Hadoop/HBase & Spark versions. We were thinking the community would then
> help us organize the project in a more "generic" way, but that didn't
> happen.
>
> So I figured I would ask here what is the gut feeling of the Spark
> community so to better define the future of our little library.
>
> Thanks
>
> -Marco
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark to HBase Fast Bulk Upload

2016-09-19 Thread Kabeer Ahmed

Hi,

Without using Spark there are a couple of options. You can refer to the link: 
http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/.

The gist is that you convert the data into HFiles and use the bulk upload 
option to get the data quickly into HBase.

HTH
Kabeer.

On Mon, 19 Sep, 2016 at 12:59 PM, Punit Naik  wrote:
Hi Guys

I have a huge dataset (~ 1TB) which has about a billion records. I have to 
transfer it to an HBase table. What is the fastest way of doing it?

--
Thank You

Regards

Punit Naik

RE: Spark with HBase Error - Py4JJavaError

2016-07-08 Thread Puneet Tripathi

Hi Ram, Thanks very much it worked.

Puneet

From: ram kumar [mailto:ramkumarro...@gmail.com]
Sent: Thursday, July 07, 2016 6:51 PM
To: Puneet Tripathi
Cc: user@spark.apache.org
Subject: Re: Spark with HBase Error - Py4JJavaError

Hi Puneet,
Have you tried appending
 --jars $SPARK_HOME/lib/spark-examples-*.jar
to the execution command?
Ram

On Thu, Jul 7, 2016 at 5:19 PM, Puneet Tripathi 
<puneet.tripa...@dunnhumby.com<mailto:puneet.tripa...@dunnhumby.com>> wrote:
Guys, Please can anyone help on the issue below?

Puneet

From: Puneet Tripathi 
[mailto:puneet.tripa...@dunnhumby.com<mailto:puneet.tripa...@dunnhumby.com>]
Sent: Thursday, July 07, 2016 12:42 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Spark with HBase Error - Py4JJavaError

Hi,

We are running Hbase in fully distributed mode. I tried to connect to Hbase via 
pyspark and then write to hbase using saveAsNewAPIHadoopDataset , but it failed 
the error says:

Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset.
: java.lang.ClassNotFoundException: 
org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
I have been able to create pythonconverters.jar and then did below:

1.  I think we have to copy this to a location on HDFS, /sparkjars/ seems a 
good a directory to create as any. I think the file has to be world readable

2.  Set the spark_jar_hdfs_path property in Cloudera Manager e.g. 
hdfs:///sparkjars

It still doesn’t seem to work can someone please help me with this.

Regards,
Puneet
dunnhumby limited is a limited company registered in England and Wales with 
registered number 02388853 and VAT registered number 927 5871 83. Our 
registered office is at Aurora House, 71-75 Uxbridge Road, London W5 5SL. The 
contents of this message and any attachments to it are confidential and may be 
legally privileged. If you have received this message in error you should 
delete it from your system immediately and advise the sender. dunnhumby may 
monitor and record all emails. The views expressed in this email are those of 
the sender and not those of dunnhumby.
dunnhumby limited is a limited company registered in England and Wales with 
registered number 02388853 and VAT registered number 927 5871 83. Our 
registered office is at Aurora House, 71-75 Uxbridge Road, London W5 5SL. The 
contents of this message and any attachments to it are confidential and may be 
legally privileged. If you have received this message in error you should 
delete it from your system immediately and advise the sender. dunnhumby may 
monitor and record all emails. The views expressed in this email are those of 
the sender and not those of dunnhumby.

dunnhumby limited is a limited company registered in England and Wales with 
registered number 02388853 and VAT registered number 927 5871 83. Our 
registered office is at Aurora House, 71-75 Uxbridge Road, London W5 5SL. The 
contents of this message and any attachments to it are confidential and may be 
legally privileged. If you have received this message in error you should 
delete it from your system immediately and advise the sender. dunnhumby may 
monitor and record all emails. The views expressed in this email are those of 
the sender and not those of dunnhumby.

Re: Spark with HBase Error - Py4JJavaError

2016-07-07 Thread ram kumar

Hi Puneet,

Have you tried appending
 --jars $SPARK_HOME/lib/spark-examples-*.jar
to the execution command?

Ram

On Thu, Jul 7, 2016 at 5:19 PM, Puneet Tripathi <
puneet.tripa...@dunnhumby.com> wrote:

> Guys, Please can anyone help on the issue below?
>
>
>
> Puneet
>
>
>
> *From:* Puneet Tripathi [mailto:puneet.tripa...@dunnhumby.com]
> *Sent:* Thursday, July 07, 2016 12:42 PM
> *To:* user@spark.apache.org
> *Subject:* Spark with HBase Error - Py4JJavaError
>
>
>
> Hi,
>
>
>
> We are running Hbase in fully distributed mode. I tried to connect to
> Hbase via pyspark and then write to hbase using *saveAsNewAPIHadoopDataset
> *, but it failed the error says:
>
>
>
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset.
>
> : java.lang.ClassNotFoundException:
> org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>
> I have been able to create pythonconverters.jar and then did below:
>
>
>
> 1.  I think we have to copy this to a location on HDFS, /sparkjars/
> seems a good a directory to create as any. I think the file has to be world
> readable
>
> 2.  Set the spark_jar_hdfs_path property in Cloudera Manager e.g.
> hdfs:///sparkjars
>
>
>
> It still doesn’t seem to work can someone please help me with this.
>
>
>
> Regards,
>
> Puneet
>
> dunnhumby limited is a limited company registered in England and Wales
> with registered number 02388853 and VAT registered number 927 5871 83. Our
> registered office is at Aurora House, 71-75 Uxbridge Road, London W5 5SL.
> The contents of this message and any attachments to it are confidential and
> may be legally privileged. If you have received this message in error you
> should delete it from your system immediately and advise the sender.
> dunnhumby may monitor and record all emails. The views expressed in this
> email are those of the sender and not those of dunnhumby.
> dunnhumby limited is a limited company registered in England and Wales
> with registered number 02388853 and VAT registered number 927 5871 83. Our
> registered office is at Aurora House, 71-75 Uxbridge Road, London W5 5SL.
> The contents of this message and any attachments to it are confidential and
> may be legally privileged. If you have received this message in error you
> should delete it from your system immediately and advise the sender.
> dunnhumby may monitor and record all emails. The views expressed in this
> email are those of the sender and not those of dunnhumby.
>

RE: Spark with HBase Error - Py4JJavaError

2016-07-07 Thread Puneet Tripathi

Guys, Please can anyone help on the issue below?

Puneet

From: Puneet Tripathi [mailto:puneet.tripa...@dunnhumby.com]
Sent: Thursday, July 07, 2016 12:42 PM
To: user@spark.apache.org
Subject: Spark with HBase Error - Py4JJavaError

Hi,

We are running Hbase in fully distributed mode. I tried to connect to Hbase via 
pyspark and then write to hbase using saveAsNewAPIHadoopDataset , but it failed 
the error says:

Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset.
: java.lang.ClassNotFoundException: 
org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
I have been able to create pythonconverters.jar and then did below:

1.  I think we have to copy this to a location on HDFS, /sparkjars/ seems a 
good a directory to create as any. I think the file has to be world readable

2.  Set the spark_jar_hdfs_path property in Cloudera Manager e.g. 
hdfs:///sparkjars

It still doesn't seem to work can someone please help me with this.

Regards,
Puneet
dunnhumby limited is a limited company registered in England and Wales with 
registered number 02388853 and VAT registered number 927 5871 83. Our 
registered office is at Aurora House, 71-75 Uxbridge Road, London W5 5SL. The 
contents of this message and any attachments to it are confidential and may be 
legally privileged. If you have received this message in error you should 
delete it from your system immediately and advise the sender. dunnhumby may 
monitor and record all emails. The views expressed in this email are those of 
the sender and not those of dunnhumby.
dunnhumby limited is a limited company registered in England and Wales with 
registered number 02388853 and VAT registered number 927 5871 83. Our 
registered office is at Aurora House, 71-75 Uxbridge Road, London W5 5SL. The 
contents of this message and any attachments to it are confidential and may be 
legally privileged. If you have received this message in error you should 
delete it from your system immediately and advise the sender. dunnhumby may 
monitor and record all emails. The views expressed in this email are those of 
the sender and not those of dunnhumby.

Re: Spark and HBase RDD join/get

2016-01-14 Thread Ted Yu

For #1, yes it is possible.

You can find some example in hbase-spark module of hbase where hbase as
DataSource is provided.
e.g.

https://github.com/apache/hbase/blob/master/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/HBaseRDDFunctions.scala

Cheers

On Thu, Jan 14, 2016 at 5:04 AM, Kristoffer Sjögren 
wrote:

> Hi
>
> We have a RDD that needs to be mapped with information from
> HBase, where the exact key is the user id.
>
> What's the different alternatives for doing this?
>
> - Is it possible to do HBase.get() requests from a map function in Spark?
> - Or should we join RDDs with all full HBase table scan?
>
> I ask because full table scans feels inefficient, especially if the
> input RDD is really small compared to the full table. But I
> realize that a full table scan may not be what happens in reality?
>
> Cheers,
> -Kristoffer
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark and HBase RDD join/get

2016-01-14 Thread Kristoffer Sjögren

Thanks Ted!

On Thu, Jan 14, 2016 at 4:49 PM, Ted Yu  wrote:
> For #1, yes it is possible.
>
> You can find some example in hbase-spark module of hbase where hbase as
> DataSource is provided.
> e.g.
>
> https://github.com/apache/hbase/blob/master/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/HBaseRDDFunctions.scala
>
> Cheers
>
> On Thu, Jan 14, 2016 at 5:04 AM, Kristoffer Sjögren 
> wrote:
>>
>> Hi
>>
>> We have a RDD that needs to be mapped with information from
>> HBase, where the exact key is the user id.
>>
>> What's the different alternatives for doing this?
>>
>> - Is it possible to do HBase.get() requests from a map function in Spark?
>> - Or should we join RDDs with all full HBase table scan?
>>
>> I ask because full table scans feels inefficient, especially if the
>> input RDD is really small compared to the full table. But I
>> realize that a full table scan may not be what happens in reality?
>>
>> Cheers,
>> -Kristoffer
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Ruslan Dautkhanov

Try Phoenix from Cloudera parcel distribution

https://blog.cloudera.com/blog/2015/11/new-apache-phoenix-4-5-2-package-from-cloudera-labs/

They may have better Kerberos support ..

On Tue, Dec 8, 2015 at 12:01 AM Akhilesh Pathodia <
pathodia.akhil...@gmail.com> wrote:

> Yes, its a kerberized cluster and ticket was generated using kinit command
> before running spark job. That's why Spark on hbase worked but when phoenix
> is used to get the connection to hbase, it does not pass the authentication
> to all nodes. Probably it is not handled in Phoenix version 4.3 or Spark
> 1.3.1 does not provide integration with Phoenix for kerberized cluster.
>
> Can anybody confirm whether Spark 1.3.1 supports Phoenix on secured
> cluster or not?
>
> Thanks,
> Akhilesh
>
> On Tue, Dec 8, 2015 at 2:57 AM, Ruslan Dautkhanov 
> wrote:
>
>> That error is not directly related to spark nor hbase
>>
>> javax.security.sasl.SaslException: GSS initiate failed [Caused by
>> GSSException: No valid credentials provided (Mechanism level: Failed to
>> find any Kerberos tgt)]
>>
>> Is this a kerberized cluster? You likely don't have a good (non-expired)
>> kerberos ticket for authentication to pass.
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>> On Mon, Dec 7, 2015 at 12:54 PM, Akhilesh Pathodia <
>> pathodia.akhil...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am running spark job on yarn in cluster mode in secured cluster. I am
>>> trying to run Spark on Hbase using Phoenix, but Spark executors are
>>> unable to get hbase connection using phoenix. I am running knit command to
>>> get the ticket before starting the job and also keytab file and principal
>>> are correctly specified in connection URL. But still spark job on each node
>>> throws below error:
>>>
>>> 15/12/01 03:23:15 ERROR ipc.AbstractRpcClient: SASL authentication
>>> failed. The most likely cause is missing or invalid credentials. Consider
>>> 'kinit'.
>>> javax.security.sasl.SaslException: GSS initiate failed [Caused by
>>> GSSException: No valid credentials provided (Mechanism level: Failed to
>>> find any Kerberos tgt)]
>>> at
>>> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>>>
>>> I am using Spark 1.3.1, Hbase 1.0.0, Phoenix 4.3. I am able to run Spark
>>> on Hbase(without phoenix) successfully in yarn-client mode as mentioned in
>>> this link:
>>>
>>> https://github.com/cloudera-labs/SparkOnHBase#scan-that-works-on-kerberos
>>>
>>> Also, I found that there is a known issue for yarn-cluster mode for
>>> Spark 1.3.1 version:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-6918
>>>
>>> Has anybody been successful in running Spark on hbase using Phoenix in
>>> yarn cluster or client mode?
>>>
>>> Thanks,
>>> Akhilesh Pathodia
>>>
>>
>>
>

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Akhilesh Pathodia

Yes, its a kerberized cluster and ticket was generated using kinit command
before running spark job. That's why Spark on hbase worked but when phoenix
is used to get the connection to hbase, it does not pass the authentication
to all nodes. Probably it is not handled in Phoenix version 4.3 or Spark
1.3.1 does not provide integration with Phoenix for kerberized cluster.

Can anybody confirm whether Spark 1.3.1 supports Phoenix on secured cluster
or not?

Thanks,
Akhilesh

On Tue, Dec 8, 2015 at 2:57 AM, Ruslan Dautkhanov 
wrote:

> That error is not directly related to spark nor hbase
>
> javax.security.sasl.SaslException: GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to
> find any Kerberos tgt)]
>
> Is this a kerberized cluster? You likely don't have a good (non-expired)
> kerberos ticket for authentication to pass.
>
>
> --
> Ruslan Dautkhanov
>
> On Mon, Dec 7, 2015 at 12:54 PM, Akhilesh Pathodia <
> pathodia.akhil...@gmail.com> wrote:
>
>> Hi,
>>
>> I am running spark job on yarn in cluster mode in secured cluster. I am
>> trying to run Spark on Hbase using Phoenix, but Spark executors are
>> unable to get hbase connection using phoenix. I am running knit command to
>> get the ticket before starting the job and also keytab file and principal
>> are correctly specified in connection URL. But still spark job on each node
>> throws below error:
>>
>> 15/12/01 03:23:15 ERROR ipc.AbstractRpcClient: SASL authentication
>> failed. The most likely cause is missing or invalid credentials. Consider
>> 'kinit'.
>> javax.security.sasl.SaslException: GSS initiate failed [Caused by
>> GSSException: No valid credentials provided (Mechanism level: Failed to
>> find any Kerberos tgt)]
>> at
>> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>>
>> I am using Spark 1.3.1, Hbase 1.0.0, Phoenix 4.3. I am able to run Spark
>> on Hbase(without phoenix) successfully in yarn-client mode as mentioned in
>> this link:
>>
>> https://github.com/cloudera-labs/SparkOnHBase#scan-that-works-on-kerberos
>>
>> Also, I found that there is a known issue for yarn-cluster mode for Spark
>> 1.3.1 version:
>>
>> https://issues.apache.org/jira/browse/SPARK-6918
>>
>> Has anybody been successful in running Spark on hbase using Phoenix in
>> yarn cluster or client mode?
>>
>> Thanks,
>> Akhilesh Pathodia
>>
>
>

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Ruslan Dautkhanov

That error is not directly related to spark nor hbase

javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)]

Is this a kerberized cluster? You likely don't have a good (non-expired)
kerberos ticket for authentication to pass.


-- 
Ruslan Dautkhanov

On Mon, Dec 7, 2015 at 12:54 PM, Akhilesh Pathodia <
pathodia.akhil...@gmail.com> wrote:

> Hi,
>
> I am running spark job on yarn in cluster mode in secured cluster. I am
> trying to run Spark on Hbase using Phoenix, but Spark executors are
> unable to get hbase connection using phoenix. I am running knit command to
> get the ticket before starting the job and also keytab file and principal
> are correctly specified in connection URL. But still spark job on each node
> throws below error:
>
> 15/12/01 03:23:15 ERROR ipc.AbstractRpcClient: SASL authentication failed.
> The most likely cause is missing or invalid credentials. Consider 'kinit'.
> javax.security.sasl.SaslException: GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to
> find any Kerberos tgt)]
> at
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>
> I am using Spark 1.3.1, Hbase 1.0.0, Phoenix 4.3. I am able to run Spark
> on Hbase(without phoenix) successfully in yarn-client mode as mentioned in
> this link:
>
> https://github.com/cloudera-labs/SparkOnHBase#scan-that-works-on-kerberos
>
> Also, I found that there is a known issue for yarn-cluster mode for Spark
> 1.3.1 version:
>
> https://issues.apache.org/jira/browse/SPARK-6918
>
> Has anybody been successful in running Spark on hbase using Phoenix in
> yarn cluster or client mode?
>
> Thanks,
> Akhilesh Pathodia
>

Re: spark to hbase

2015-10-27 Thread Deng Ching-Mallete

Hi,

It would be more efficient if you configure the table and flush the commits
by partition instead of per element in the RDD. The latter works fine
because you only have 4 elements, but it won't bid well for large data sets
IMO..

Thanks,
Deng

On Tue, Oct 27, 2015 at 5:22 PM, jinhong lu  wrote:

>
> Hi,
>
> I write my result to hdfs, it did well:
>
> val model = 
> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>  TrainFeature())(seqOp, combOp).values
>  model.map(a => (a.toKey() + "\t" + a.totalCount + "\t" + 
> a.positiveCount)).saveAsTextFile(modelDataPath);
>
> But when I want to write to hbase, the applicaton hung, no log, no
> response, just stay there, and nothing is written to hbase:
>
> val model = 
> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>  TrainFeature())(seqOp, combOp).values.foreach({ res =>
>   val configuration = HBaseConfiguration.create();
>   configuration.set("hbase.zookeeper.property.clientPort", "2181");
>   configuration.set("hbase.zookeeper.quorum", “192.168.1.66");
>   configuration.set("hbase.master", "192.168.1:6");
>   val hadmin = new HBaseAdmin(configuration);
>   val table = new HTable(configuration, "ljh_test3");
>   var put = new Put(Bytes.toBytes(res.toKey()));
>   put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
> Bytes.toBytes(res.totalCount + res.positiveCount));
>   table.put(put);
>   table.flushCommits()
> })
>
> And then I try to write som simple data to hbase, it did well too:
>
> sc.parallelize(Array(1,2,3,4)).foreach({ res =>
> val configuration = HBaseConfiguration.create();
> configuration.set("hbase.zookeeper.property.clientPort", "2181");
> configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
> configuration.set("hbase.master", "192.168.1:6");
> val hadmin = new HBaseAdmin(configuration);
> val table = new HTable(configuration, "ljh_test3");
> var put = new Put(Bytes.toBytes(res));
> put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), Bytes.toBytes(res));
> table.put(put);
> table.flushCommits()
> })
>
> what is the problem with the 2rd code? thanks a lot.
>
>

Re: spark to hbase

2015-10-27 Thread Ted Yu

Jinghong:
Hadmin variable is not used. You can omit that line. 

Which hbase release are you using ?

As Deng said, don't flush per row. 

Cheers

> On Oct 27, 2015, at 3:21 AM, Deng Ching-Mallete  wrote:
> 
> Hi,
> 
> It would be more efficient if you configure the table and flush the commits 
> by partition instead of per element in the RDD. The latter works fine because 
> you only have 4 elements, but it won't bid well for large data sets IMO..
> 
> Thanks,
> Deng
> 
>> On Tue, Oct 27, 2015 at 5:22 PM, jinhong lu  wrote:
>> 
>> Hi, 
>> 
>> I write my result to hdfs, it did well:
>> 
>> val model = 
>> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>>  TrainFeature())(seqOp, combOp).values
>>  model.map(a => (a.toKey() + "\t" + a.totalCount + "\t" + 
>> a.positiveCount)).saveAsTextFile(modelDataPath);
>> 
>> But when I want to write to hbase, the applicaton hung, no log, no response, 
>> just stay there, and nothing is written to hbase:
>> 
>> val model = 
>> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>>  TrainFeature())(seqOp, combOp).values.foreach({ res =>
>>   val configuration = HBaseConfiguration.create();
>>   configuration.set("hbase.zookeeper.property.clientPort", "2181");
>>   configuration.set("hbase.zookeeper.quorum", “192.168.1.66");
>>   configuration.set("hbase.master", "192.168.1:6");
>>   val hadmin = new HBaseAdmin(configuration);
>>   val table = new HTable(configuration, "ljh_test3");
>>   var put = new Put(Bytes.toBytes(res.toKey()));
>>   put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
>> Bytes.toBytes(res.totalCount + res.positiveCount));
>>   table.put(put);
>>   table.flushCommits()
>> })
>> 
>> And then I try to write som simple data to hbase, it did well too:
>> 
>> sc.parallelize(Array(1,2,3,4)).foreach({ res =>
>> val configuration = HBaseConfiguration.create();
>> configuration.set("hbase.zookeeper.property.clientPort", "2181");
>> configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
>> configuration.set("hbase.master", "192.168.1:6");
>> val hadmin = new HBaseAdmin(configuration);
>> val table = new HTable(configuration, "ljh_test3");
>> var put = new Put(Bytes.toBytes(res));
>> put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), Bytes.toBytes(res));
>> table.put(put);
>> table.flushCommits()
>> })
>> 
>> what is the problem with the 2rd code? thanks a lot.
>> 
>

Re: spark to hbase

2015-10-27 Thread Ted Yu

Jinghong:
In one of earlier threads on storing data to hbase, it was found that
htrace jar was not on classpath, leading to write failure.

Can you check whether you are facing the same problem ?

Cheers

On Tue, Oct 27, 2015 at 5:11 AM, Ted Yu  wrote:

> Jinghong:
> Hadmin variable is not used. You can omit that line.
>
> Which hbase release are you using ?
>
> As Deng said, don't flush per row.
>
> Cheers
>
> On Oct 27, 2015, at 3:21 AM, Deng Ching-Mallete  wrote:
>
> Hi,
>
> It would be more efficient if you configure the table and flush the
> commits by partition instead of per element in the RDD. The latter works
> fine because you only have 4 elements, but it won't bid well for large data
> sets IMO..
>
> Thanks,
> Deng
>
> On Tue, Oct 27, 2015 at 5:22 PM, jinhong lu  wrote:
>
>>
>> Hi,
>>
>> I write my result to hdfs, it did well:
>>
>> val model = 
>> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>>  TrainFeature())(seqOp, combOp).values
>>  model.map(a => (a.toKey() + "\t" + a.totalCount + "\t" + 
>> a.positiveCount)).saveAsTextFile(modelDataPath);
>>
>> But when I want to write to hbase, the applicaton hung, no log, no
>> response, just stay there, and nothing is written to hbase:
>>
>> val model = 
>> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>>  TrainFeature())(seqOp, combOp).values.foreach({ res =>
>>   val configuration = HBaseConfiguration.create();
>>   configuration.set("hbase.zookeeper.property.clientPort", "2181");
>>   configuration.set("hbase.zookeeper.quorum", “192.168.1.66");
>>   configuration.set("hbase.master", "192.168.1:6");
>>   val hadmin = new HBaseAdmin(configuration);
>>   val table = new HTable(configuration, "ljh_test3");
>>   var put = new Put(Bytes.toBytes(res.toKey()));
>>   put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
>> Bytes.toBytes(res.totalCount + res.positiveCount));
>>   table.put(put);
>>   table.flushCommits()
>> })
>>
>> And then I try to write som simple data to hbase, it did well too:
>>
>> sc.parallelize(Array(1,2,3,4)).foreach({ res =>
>> val configuration = HBaseConfiguration.create();
>> configuration.set("hbase.zookeeper.property.clientPort", "2181");
>> configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
>> configuration.set("hbase.master", "192.168.1:6");
>> val hadmin = new HBaseAdmin(configuration);
>> val table = new HTable(configuration, "ljh_test3");
>> var put = new Put(Bytes.toBytes(res));
>> put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), Bytes.toBytes(res));
>> table.put(put);
>> table.flushCommits()
>> })
>>
>> what is the problem with the 2rd code? thanks a lot.
>>
>>

Re: spark to hbase

2015-10-27 Thread jinhong lu

Hi, Ted

thanks for your help.

I check the jar, it is in classpath, and now the problem is :

1、 Follow codes runs good, and it put the  result to hbse:

  val res = 
lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
 TrainFeature())(seqOp, combOp).values.first()
 val configuration = HBaseConfiguration.create();
  configuration.set("hbase.zookeeper.property.clientPort", "2181");
  configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
  configuration.set("hbase.master", "192.168.1.66:6");
  val table = new HTable(configuration, "ljh_test3");
  var put = new Put(Bytes.toBytes(res.toKey()));
  put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
Bytes.toBytes(res.positiveCount));
  table.put(put);
  table.flushCommits()

2、But if I change the first() function to foreach:

  
lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
 TrainFeature())(seqOp, combOp).values.foreach({res=>
  val configuration = HBaseConfiguration.create();
  configuration.set("hbase.zookeeper.property.clientPort", "2181");
  configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
  configuration.set("hbase.master", "192.168.1.66:6");
  val table = new HTable(configuration, "ljh_test3");
  var put = new Put(Bytes.toBytes(res.toKey()));
  put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
Bytes.toBytes(res.positiveCount));
  table.put(put);

})

the application hung, and the last log is :

15/10/28 09:30:33 INFO DAGScheduler: Missing parents for ResultStage 2: List()
15/10/28 09:30:33 INFO DAGScheduler: Submitting ResultStage 2 
(MapPartitionsRDD[6] at values at TrainModel3.scala:98), which is now runnable
15/10/28 09:30:33 INFO MemoryStore: ensureFreeSpace(7032) called with 
curMem=264045, maxMem=278302556
15/10/28 09:30:33 INFO MemoryStore: Block broadcast_3 stored as values in 
memory (estimated size 6.9 KB, free 265.2 MB)
15/10/28 09:30:33 INFO MemoryStore: ensureFreeSpace(3469) called with 
curMem=271077, maxMem=278302556
15/10/28 09:30:33 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in 
memory (estimated size 3.4 KB, free 265.1 MB)
15/10/28 09:30:33 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 
10.120.69.53:43019 (size: 3.4 KB, free: 265.4 MB)
15/10/28 09:30:33 INFO SparkContext: Created broadcast 3 from broadcast at 
DAGScheduler.scala:874
15/10/28 09:30:33 INFO DAGScheduler: Submitting 1 missing tasks from 
ResultStage 2 (MapPartitionsRDD[6] at values at TrainModel3.scala:98)
15/10/28 09:30:33 INFO YarnScheduler: Adding task set 2.0 with 1 tasks
15/10/28 09:30:33 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, 
gdc-dn147-formal.i.nease.net, PROCESS_LOCAL, 1716 bytes)
15/10/28 09:30:34 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 
gdc-dn147-formal.i.nease.net:59814 (size: 3.4 KB, free: 1060.3 MB)
15/10/28 09:30:34 INFO MapOutputTrackerMasterEndpoint: Asked to send map output 
locations for shuffle 0 to gdc-dn147-formal.i.nease.net:52904
15/10/28 09:30:34 INFO MapOutputTrackerMaster: Size of output statuses for 
shuffle 0 is 154 bytes

3、besides, I take the configuration and HTable out of foreach:

val configuration = HBaseConfiguration.create();
configuration.set("hbase.zookeeper.property.clientPort", "2181");
configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
configuration.set("hbase.master", "192.168.1.66:6");
val table = new HTable(configuration, "ljh_test3");

lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
 TrainFeature())(seqOp, combOp).values.foreach({ res =>

  var put = new Put(Bytes.toBytes(res.toKey()));
  put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
Bytes.toBytes(res.positiveCount));
  table.put(put);

})
table.flushCommits()

found serializable problem:

Exception in thread "main" org.apache.spark.SparkException: Task not 
serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
at org.apache.spark.rdd.RDD
$$anonfun$foreach$1.apply(RDD.scala:869)
at 
org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:868)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:868)
at com.chencai.spark.ml.TrainModel3$.train(TrainModel3.scala:100)
at com.chencai.spark.ml.TrainModel3$.main(TrainModel3.scala:115)
at com.chencai.spark.ml.TrainModel3.main(TrainModel3.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native

Re: spark to hbase

2015-10-27 Thread Ted Yu

For #2, have you checked task log(s) to see if there was some clue ?

You may want to use foreachPartition to reduce the number of flushes.

In the future, please remove color coding - it is not easy to read.

Cheers

On Tue, Oct 27, 2015 at 6:53 PM, jinhong lu  wrote:

> Hi, Ted
>
> thanks for your help.
>
> I check the jar, it is in classpath, and now the problem is :
>
> 1、 Follow codes runs good, and it put the  result to hbse:
>
>   val res = 
> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>  TrainFeature())(seqOp, combOp).values.first()
>  val configuration = HBaseConfiguration.create();
>   configuration.set("hbase.zookeeper.property.clientPort", "2181");
>   configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
>   configuration.set("hbase.master", "192.168.1.66:6");
>   val table = new HTable(configuration, "ljh_test3");
>   var put = new Put(Bytes.toBytes(res.toKey()));
>   put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
> Bytes.toBytes(res.positiveCount));
>   table.put(put);
>   table.flushCommits()
>
> 2、But if I change the first() function to foreach:
>
>   
> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>  TrainFeature())(seqOp, combOp).values.foreach({res=>
>   val configuration = HBaseConfiguration.create();
>   configuration.set("hbase.zookeeper.property.clientPort", "2181");
>   configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
>   configuration.set("hbase.master", "192.168.1.66:6");
>   val table = new HTable(configuration, "ljh_test3");
>   var put = new Put(Bytes.toBytes(res.toKey()));
>   put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
> Bytes.toBytes(res.positiveCount));
>   table.put(put);
>
> })
>
> the application hung, and the last log is :
>
> 15/10/28 09:30:33 INFO DAGScheduler: Missing parents for ResultStage 2: List()
> 15/10/28 09:30:33 INFO DAGScheduler: Submitting ResultStage 2 
> (MapPartitionsRDD[6] at values at TrainModel3.scala:98), which is now runnable
> 15/10/28 09:30:33 INFO MemoryStore: ensureFreeSpace(7032) called with 
> curMem=264045, maxMem=278302556
> 15/10/28 09:30:33 INFO MemoryStore: Block broadcast_3 stored as values in 
> memory (estimated size 6.9 KB, free 265.2 MB)
> 15/10/28 09:30:33 INFO MemoryStore: ensureFreeSpace(3469) called with 
> curMem=271077, maxMem=278302556
> 15/10/28 09:30:33 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes 
> in memory (estimated size 3.4 KB, free 265.1 MB)
> 15/10/28 09:30:33 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory 
> on 10.120.69.53:43019 (size: 3.4 KB, free: 265.4 MB)
> 15/10/28 09:30:33 INFO SparkContext: Created broadcast 3 from broadcast at 
> DAGScheduler.scala:874
> 15/10/28 09:30:33 INFO DAGScheduler: Submitting 1 missing tasks from 
> ResultStage 2 (MapPartitionsRDD[6] at values at TrainModel3.scala:98)
> 15/10/28 09:30:33 INFO YarnScheduler: Adding task set 2.0 with 1 tasks
> 15/10/28 09:30:33 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, 
> gdc-dn147-formal.i.nease.net, PROCESS_LOCAL, 1716 bytes)
> 15/10/28 09:30:34 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory 
> on gdc-dn147-formal.i.nease.net:59814 (size: 3.4 KB, free: 1060.3 MB)
> 15/10/28 09:30:34 INFO MapOutputTrackerMasterEndpoint: Asked to send map 
> output locations for shuffle 0 to gdc-dn147-formal.i.nease.net:52904
> 15/10/28 09:30:34 INFO MapOutputTrackerMaster: Size of output statuses for 
> shuffle 0 is 154 bytes
>
> 3、besides, I take the configuration and HTable out of foreach:
>
> val configuration = HBaseConfiguration.create();
> configuration.set("hbase.zookeeper.property.clientPort", "2181");
> configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
> configuration.set("hbase.master", "192.168.1.66:6");
> val table = new HTable(configuration, "ljh_test3");
>
> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>  TrainFeature())(seqOp, combOp).values.foreach({ res =>
>
>   var put = new Put(Bytes.toBytes(res.toKey()));
>   put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
> Bytes.toBytes(res.positiveCount));
>   table.put(put);
>
> })
> table.flushCommits()
>
> found serializable problem:
>
> Exception in thread "main" org.apache.spark.SparkException: Task not 
> serializable
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
> at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
> at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
> at org.apache.spark.rdd.RDD
> $$anonfun$foreach$1.apply(RDD.scala:869)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:868)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
>

Re: spark to hbase

2015-10-27 Thread jinhong lu

I write a demo, but still no response, no error, no log.

My hbase is 0.98, hadoop 2.3, spark 1.4.

And I run in yarn-client mode. any idea? thanks.


package com.lujinhong.sparkdemo

import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.util.Bytes

object SparkConnectHbase2 extends Serializable {

  def main(args: Array[String]) {
new SparkConnectHbase2().toHbase();
  }

}

class SparkConnectHbase2 extends Serializable {

  def toHbase() {
val conf = new SparkConf().setAppName("ljh_ml3");
val sc = new SparkContext(conf)

val tmp = sc.parallelize(Array(601, 701, 801, 901)).foreachPartition({ a => 
  val configuration = HBaseConfiguration.create();
  configuration.set("hbase.zookeeper.property.clientPort", "2181");
  configuration.set("hbase.zookeeper.quorum", “192.168.1.66");
  configuration.set("hbase.master", “192.168.1.66:6");
  val table = new HTable(configuration, "ljh_test4");
  var put = new Put(Bytes.toBytes(a+""));
  put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), Bytes.toBytes(a + 
"value"));
  table.put(put);
  table.flushCommits();
})

  }

}


> 在 2015年10月28日，10:23，Fengdong Yu  写道：
> 
> Also, please remove the HBase related to the Scala Object, this will resolve 
> the serialize issue and avoid open connection repeatedly.
> 
> and remember close the table after the final flush.
> 
> 
> 
>> On Oct 28, 2015, at 10:13 AM, Ted Yu > > wrote:
>> 
>> For #2, have you checked task log(s) to see if there was some clue ?
>> 
>> You may want to use foreachPartition to reduce the number of flushes.
>> 
>> In the future, please remove color coding - it is not easy to read.
>> 
>> Cheers
>> 
>> On Tue, Oct 27, 2015 at 6:53 PM, jinhong lu > > wrote:
>> Hi, Ted
>> 
>> thanks for your help.
>> 
>> I check the jar, it is in classpath, and now the problem is :
>> 
>> 1、 Follow codes runs good, and it put the  result to hbse:
>> 
>>   val res = 
>> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>>  TrainFeature())(seqOp, combOp).values.first()
>>  val configuration = HBaseConfiguration.create();
>>   configuration.set("hbase.zookeeper.property.clientPort", "2181");
>>   configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
>>   configuration.set("hbase.master", "192.168.1.66:6 
>> ");
>>   val table = new HTable(configuration, "ljh_test3");
>>   var put = new Put(Bytes.toBytes(res.toKey()));
>>   put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
>> Bytes.toBytes(res.positiveCount));
>>   table.put(put);
>>   table.flushCommits()
>> 
>> 2、But if I change the first() function to foreach:
>> 
>>   
>> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>>  TrainFeature())(seqOp, combOp).values.foreach({res=>
>>   val configuration = HBaseConfiguration.create();
>>   configuration.set("hbase.zookeeper.property.clientPort", "2181");
>>   configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
>>   configuration.set("hbase.master", "192.168.1.66:6 
>> ");
>>   val table = new HTable(configuration, "ljh_test3");
>>   var put = new Put(Bytes.toBytes(res.toKey()));
>>   put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
>> Bytes.toBytes(res.positiveCount));
>>   table.put(put);
>> 
>> })
>> 
>> the application hung, and the last log is :
>> 
>> 15/10/28 09:30:33 INFO DAGScheduler: Missing parents for ResultStage 2: 
>> List()
>> 15/10/28 09:30:33 INFO DAGScheduler: Submitting ResultStage 2 
>> (MapPartitionsRDD[6] at values at TrainModel3.scala:98), which is now 
>> runnable
>> 15/10/28 09:30:33 INFO MemoryStore: ensureFreeSpace(7032) called with 
>> curMem=264045, maxMem=278302556
>> 15/10/28 09:30:33 INFO MemoryStore: Block broadcast_3 stored as values in 
>> memory (estimated size 6.9 KB, free 265.2 MB)
>> 15/10/28 09:30:33 INFO MemoryStore: ensureFreeSpace(3469) called with 
>> curMem=271077, maxMem=278302556
>> 15/10/28 09:30:33 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes 
>> in memory (estimated size 3.4 KB, free 265.1 MB)
>> 15/10/28 09:30:33 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory 
>> on 10.120.69.53:43019 (size: 3.4 KB, free: 265.4 MB)
>> 15/10/28 09:30:33 INFO SparkContext: Created broadcast 3 from broadcast at 
>> DAGScheduler.scala:874
>> 15/10/28 09:30:33 INFO DAGScheduler: Submitting 1 missing tasks from 
>> ResultStage 2

Re: spark to hbase

2015-10-27 Thread Fengdong Yu

Also, please remove the HBase related to the Scala Object, this will resolve 
the serialize issue and avoid open connection repeatedly.

and remember close the table after the final flush.



> On Oct 28, 2015, at 10:13 AM, Ted Yu  wrote:
> 
> For #2, have you checked task log(s) to see if there was some clue ?
> 
> You may want to use foreachPartition to reduce the number of flushes.
> 
> In the future, please remove color coding - it is not easy to read.
> 
> Cheers
> 
> On Tue, Oct 27, 2015 at 6:53 PM, jinhong lu  > wrote:
> Hi, Ted
> 
> thanks for your help.
> 
> I check the jar, it is in classpath, and now the problem is :
> 
> 1、 Follow codes runs good, and it put the  result to hbse:
> 
>   val res = 
> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>  TrainFeature())(seqOp, combOp).values.first()
>  val configuration = HBaseConfiguration.create();
>   configuration.set("hbase.zookeeper.property.clientPort", "2181");
>   configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
>   configuration.set("hbase.master", "192.168.1.66:6 
> ");
>   val table = new HTable(configuration, "ljh_test3");
>   var put = new Put(Bytes.toBytes(res.toKey()));
>   put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
> Bytes.toBytes(res.positiveCount));
>   table.put(put);
>   table.flushCommits()
> 
> 2、But if I change the first() function to foreach:
> 
>   
> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>  TrainFeature())(seqOp, combOp).values.foreach({res=>
>   val configuration = HBaseConfiguration.create();
>   configuration.set("hbase.zookeeper.property.clientPort", "2181");
>   configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
>   configuration.set("hbase.master", "192.168.1.66:6 
> ");
>   val table = new HTable(configuration, "ljh_test3");
>   var put = new Put(Bytes.toBytes(res.toKey()));
>   put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
> Bytes.toBytes(res.positiveCount));
>   table.put(put);
> 
> })
> 
> the application hung, and the last log is :
> 
> 15/10/28 09:30:33 INFO DAGScheduler: Missing parents for ResultStage 2: List()
> 15/10/28 09:30:33 INFO DAGScheduler: Submitting ResultStage 2 
> (MapPartitionsRDD[6] at values at TrainModel3.scala:98), which is now runnable
> 15/10/28 09:30:33 INFO MemoryStore: ensureFreeSpace(7032) called with 
> curMem=264045, maxMem=278302556
> 15/10/28 09:30:33 INFO MemoryStore: Block broadcast_3 stored as values in 
> memory (estimated size 6.9 KB, free 265.2 MB)
> 15/10/28 09:30:33 INFO MemoryStore: ensureFreeSpace(3469) called with 
> curMem=271077, maxMem=278302556
> 15/10/28 09:30:33 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes 
> in memory (estimated size 3.4 KB, free 265.1 MB)
> 15/10/28 09:30:33 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory 
> on 10.120.69.53:43019 (size: 3.4 KB, free: 265.4 MB)
> 15/10/28 09:30:33 INFO SparkContext: Created broadcast 3 from broadcast at 
> DAGScheduler.scala:874
> 15/10/28 09:30:33 INFO DAGScheduler: Submitting 1 missing tasks from 
> ResultStage 2 (MapPartitionsRDD[6] at values at TrainModel3.scala:98)
> 15/10/28 09:30:33 INFO YarnScheduler: Adding task set 2.0 with 1 tasks
> 15/10/28 09:30:33 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, 
> gdc-dn147-formal.i.nease.net , 
> PROCESS_LOCAL, 1716 bytes)
> 15/10/28 09:30:34 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory 
> on gdc-dn147-formal.i.nease.net:59814 (size: 3.4 KB, free: 1060.3 MB)
> 15/10/28 09:30:34 INFO MapOutputTrackerMasterEndpoint: Asked to send map 
> output locations for shuffle 0 to gdc-dn147-formal.i.nease.net:52904
> 15/10/28 09:30:34 INFO MapOutputTrackerMaster: Size of output statuses for 
> shuffle 0 is 154 bytes
> 
> 3、besides, I take the configuration and HTable out of foreach:
> 
> val configuration = HBaseConfiguration.create();
> configuration.set("hbase.zookeeper.property.clientPort", "2181");
> configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
> configuration.set("hbase.master", "192.168.1.66:6");
> val table = new HTable(configuration, "ljh_test3");
> 
> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>  TrainFeature())(seqOp, combOp).values.foreach({ res =>
> 
>   var put = new Put(Bytes.toBytes(res.toKey()));
>   put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
> Bytes.toBytes(res.positiveCount));
>   table.put(put);
> 
> })
> table.flushCommits()
> 
> found serializable problem:
> 
> Exception in thread "main" org.apache.spark.SparkException: Task not 
> serializable
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
> at 
>

Re: Spark and HBase join issue

2015-03-14 Thread Ted Yu

The 4.1 GB table has 3 regions. This means that there would be at least 2
nodes which don't carry its region.
Can you split this table into 12 (or more) regions ?

BTW what's the value for spark.yarn.executor.memoryOverhead ?

Cheers

On Sat, Mar 14, 2015 at 10:52 AM, francexo83 francex...@gmail.com wrote:

 Hi all,


 I have the following  cluster configurations:


- 5 nodes on a cloud environment.
- Hadoop 2.5.0.
- HBase 0.98.6.
- Spark 1.2.0.
- 8 cores and 16 GB of ram on each host.
- 1 NFS disk with 300 IOPS  mounted on host 1 and 2.
- 1 NFS disk with 300 IOPS  mounted on  host 3,4 and 5.

 I tried  to run  a spark job in cluster mode that computes the left outer
 join between two hbase tables.
 The first table  stores  about 4.1 GB of data spread across  3 regions
 with Snappy compression.
 The second one stores  about 1.2 GB of data spread across  22 regions with
 Snappy compression.

 I sometimes get executor lost during in the shuffle phase  during the last
 stage (saveAsHadoopDataset).

 Below my spark conf:

 num-cpu-cores = 20
 memory-per-node = 10G
 spark.scheduler.mode = FAIR
 spark.scheduler.pool = production
 spark.shuffle.spill= true
 spark.rdd.compress = true
 spark.core.connection.auth.wait.timeout=2000
 spark.sql.shuffle.partitions=100
 spark.default.parallelism=50
 spark.speculation=false
 spark.shuffle.spill=true
 spark.shuffle.memoryFraction=0.1
 spark.cores.max=30
 spark.driver.memory=10g

 Are  the resource to low to handle this  kind of operation?

 if yes, could you share with me the right configuration to perform this
 kind of task?

 Thank you in advance.

 F.

Re: Spark with HBase

2014-12-15 Thread Aniket Bhatnagar

In case you are still looking for help, there has been multiple discussions
in this mailing list that you can try searching for. Or you can simply use
https://github.com/unicredit/hbase-rdd :-)

Thanks,
Aniket

On Wed Dec 03 2014 at 16:11:47 Ted Yu yuzhih...@gmail.com wrote:

 Which hbase release are you running ?
 If it is 0.98, take a look at:

 https://issues.apache.org/jira/browse/SPARK-1297

 Thanks

 On Dec 2, 2014, at 10:21 PM, Jai jaidishhari...@gmail.com wrote:

 I am trying to use Apache Spark with a psuedo distributed Hadoop Hbase
 Cluster and I am looking for some links regarding the same. Can someone
 please guide me through the steps to accomplish this. Thanks a lot for
 Helping



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-HBase-tp20226.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark with HBase

2014-12-03 Thread Akhil Das

You could go through these to start with

http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase

http://stackoverflow.com/questions/25189527/how-to-process-a-range-of-hbase-rows-using-spark

Thanks
Best Regards

On Wed, Dec 3, 2014 at 11:51 AM, Jai jaidishhari...@gmail.com wrote:

 I am trying to use Apache Spark with a psuedo distributed Hadoop Hbase
 Cluster and I am looking for some links regarding the same. Can someone
 please guide me through the steps to accomplish this. Thanks a lot for
 Helping



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-HBase-tp20226.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark with HBase

2014-12-03 Thread Ted Yu

Which hbase release are you running ?
If it is 0.98, take a look at:

https://issues.apache.org/jira/browse/SPARK-1297

Thanks

On Dec 2, 2014, at 10:21 PM, Jai jaidishhari...@gmail.com wrote:

 I am trying to use Apache Spark with a psuedo distributed Hadoop Hbase
 Cluster and I am looking for some links regarding the same. Can someone
 please guide me through the steps to accomplish this. Thanks a lot for
 Helping
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-HBase-tp20226.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: spark 1.1.0 - hbase 0.98.6-hadoop2 version - py4j.protocol.Py4JJavaError java.lang.ClassNotFoundException

2014-10-04 Thread Nick Pentreath

forgot to copy user list

On Sat, Oct 4, 2014 at 3:12 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 what version did you put in the pom.xml?

 it does seem to be in Maven central:
 http://search.maven.org/#artifactdetails%7Corg.apache.hbase%7Chbase%7C0.98.6-hadoop2%7Cpom

 dependency
 groupIdorg.apache.hbase/groupId
 artifactIdhbase/artifactId
 version0.98.6-hadoop2/version
 /dependency

 Note you shouldn't need to rebuild Spark, I think just the example project
 via sbt examples/assembly

 On Fri, Oct 3, 2014 at 10:55 AM, serkan.dogan foreignerdr...@yahoo.com
 wrote:

 Hi,
 I installed hbase-0.98.6-hadoop2. It's working not any problem with that.


 When i am try to run spark hbase  python examples, (wordcount examples
 working - not python issue)

  ./bin/spark-submit  --master local --driver-class-path
 ./examples/target/spark-examples_2.10-1.1.0.jar
 ./examples/src/main/python/hbase_inputformat.py localhost myhbasetable

 the process exit with ClassNotFoundException...

 I search lots of blogs, sites all says spark 1.1 version built with hbase
 0.94.6 rebuild with own hbase version.

 I try first,
 change hbase version number - in pom.xml  -- nothing found maven central

 I try second,
 compile hbase from src and copy hbase/lib folder hbase jars to
 spark/lib_managed folder and edit spark-defaults.conf

 my spark-defaults.conf

 spark.executor.extraClassPath

 /home/downloads/spark/spark-1.1.0/lib_managed/jars/hbase-server-0.98.6-hadoop2.jar:/home/downloads/spark/spark-1.1.0/lib_managed/jars/hbase-protocol-0.98.6-hadoop2.jar:/home/downloads/spark/spark-1.1.0/lib_managed/jars/hbase-hadoop2-compat-0.98.6-hadoop2.jar:/home/downloads/spark/spark-1.1.0/lib_managed/jars/hbase-client-0.98.6-hadoop2.jar:/home/downloads/spark/spark-1.1.0/lib_managed/jars/hbase-commont-0.98.6-hadoop2.jar:/home/downloads/spark/spark-1.1.0/lib_managed/jars/htrace-core-2.04.jar


 My question is how i can work with hbase 0.98.6-hadoop2 with spark 1.1.0

 Here is the exception message


 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 14/10/03 11:27:15 WARN Utils: Your hostname, xxx.yyy.com resolves to a
 loopback address: 127.0.0.1; using 1.1.1.1 instead (on interface eth0)
 14/10/03 11:27:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
 another address
 14/10/03 11:27:15 INFO SecurityManager: Changing view acls to: root,
 14/10/03 11:27:15 INFO SecurityManager: Changing modify acls to: root,
 14/10/03 11:27:15 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(root, );
 users
 with modify permissions: Set(root, )
 14/10/03 11:27:16 INFO Slf4jLogger: Slf4jLogger started
 14/10/03 11:27:16 INFO Remoting: Starting remoting
 14/10/03 11:27:16 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkdri...@1-1-1-1-1.rev.mydomain.io:49256]
 14/10/03 11:27:16 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://sparkdri...@1-1-1-1-1.rev.mydomain.io:49256]
 14/10/03 11:27:16 INFO Utils: Successfully started service 'sparkDriver'
 on
 port 49256.
 14/10/03 11:27:16 INFO SparkEnv: Registering MapOutputTracker
 14/10/03 11:27:16 INFO SparkEnv: Registering BlockManagerMaster
 14/10/03 11:27:16 INFO DiskBlockManager: Created local directory at
 /tmp/spark-local-20141003112716-298d
 14/10/03 11:27:16 INFO Utils: Successfully started service 'Connection
 manager for block manager' on port 35106.
 14/10/03 11:27:16 INFO ConnectionManager: Bound socket to port 35106 with
 id
 = ConnectionManagerId(1-1-1-1-1.rev.mydomain.io,35106)
 14/10/03 11:27:16 INFO MemoryStore: MemoryStore started with capacity
 267.3
 MB
 14/10/03 11:27:16 INFO BlockManagerMaster: Trying to register BlockManager
 14/10/03 11:27:16 INFO BlockManagerMasterActor: Registering block manager
 1-1-1-1-1.rev.mydomain.io:35106 with 267.3 MB RAM
 14/10/03 11:27:16 INFO BlockManagerMaster: Registered BlockManager
 14/10/03 11:27:16 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-f60b0533-998f-4af2-a208-d04c571eab82
 14/10/03 11:27:16 INFO HttpServer: Starting HTTP Server
 14/10/03 11:27:16 INFO Utils: Successfully started service 'HTTP file
 server' on port 49611.
 14/10/03 11:27:16 INFO Utils: Successfully started service 'SparkUI' on
 port
 4040.
 14/10/03 11:27:16 INFO SparkUI: Started SparkUI at
 http://1-1-1-1-1.rev.mydomain.io:4040
 14/10/03 11:27:16 INFO Utils: Copying

 /home/downloads/spark/spark-1.1.0/./examples/src/main/python/hbase_inputformat.py
 to /tmp/spark-7232227a-0547-454e-9f68-805fa7b0c2f0/hbase_inputformat.py
 14/10/03 11:27:16 INFO SparkContext: Added file

 file:/home/downloads/spark/spark-1.1.0/./examples/src/main/python/hbase_inputformat.py
 at http://1.1.1.1:49611/files/hbase_inputformat.py with timestamp
 1412324836837
 14/10/03 11:27:16 INFO AkkaUtils: Connecting to HeartbeatReceiver:
 akka.tcp://
 sparkdri...@1-1-1-1-1.rev.mydomain.io:49256/user/HeartbeatReceiver
 Traceback (most

Re: Spark with HBase

2014-08-07 Thread Akhil Das

You can download and compile spark against your existing hadoop version.

Here's a quick start
https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types

You can also read a bit here
http://docs.sigmoidanalytics.com/index.php/Installing_Spark_andSetting_Up_Your_Cluster
( the version is quiet old)

Attached is a piece of Code (Spark Java API) to connect to HBase.



Thanks
Best Regards


On Thu, Aug 7, 2014 at 1:48 PM, Deepa Jayaveer deepa.jayav...@tcs.com
wrote:

 Hi
 I read your white paper about   . We wanted to do a Proof of Concept on
 Spark with HBase. Documents
 are not much available to set up the spark cluster  in Hadoop 2
 environment. If you have any,
 can you please give us some reference URLs
 Also, some sample program to connect to HBase using Spark Java API

 Thanks
 Deepa

 =-=-=
 Notice: The information contained in this e-mail
 message and/or attachments to it may contain
 confidential or privileged information. If you are
 not the intended recipient, any dissemination, use,
 review, distribution, printing or copying of the
 information contained in this e-mail message
 and/or attachments to it are strictly prohibited. If
 you have received this communication in error,
 please notify us by reply e-mail or telephone and
 immediately and permanently delete the message
 and any attachments. Thank you


import java.util.Iterator;
import java.util.List;

import org.apache.commons.configuration.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.rdd.NewHadoopRDD;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;

import com.google.common.collect.Lists;

import scala.Function1;
import scala.Tuple2;
import scala.collection.JavaConversions;
import scala.collection.Seq;
import scala.collection.JavaConverters.*;
import scala.reflect.ClassTag;

public class SparkHBaseMain {

	
	@SuppressWarnings(deprecation)
	public static void main(String[] arg){
		
		try{
			
			ListString jars = Lists.newArrayList(/home/akhld/Desktop/tools/spark-9/jars/spark-assembly-0.9.0-incubating-hadoop2.3.0-mr1-cdh5.0.0.jar,
	/home/akhld/Downloads/sparkhbasecode/hbase-server-0.96.0-hadoop2.jar,
	/home/akhld/Downloads/sparkhbasecode/hbase-protocol-0.96.0-hadoop2.jar,
	/home/akhld/Downloads/sparkhbasecode/hbase-hadoop2-compat-0.96.0-hadoop2.jar,
	/home/akhld/Downloads/sparkhbasecode/hbase-common-0.96.0-hadoop2.jar,
	/home/akhld/Downloads/sparkhbasecode/hbase-client-0.96.0-hadoop2.jar,
	/home/akhld/Downloads/sparkhbasecode/htrace-core-2.02.jar);

			SparkConf spconf = new SparkConf();
			spconf.setMaster(local);
			spconf.setAppName(SparkHBase);
			spconf.setSparkHome(/home/akhld/Desktop/tools/spark-9);
			spconf.setJars(jars.toArray(new String[jars.size()]));
			spconf.set(spark.executor.memory, 1g);

			final JavaSparkContext sc = new JavaSparkContext(spconf);
		
			org.apache.hadoop.conf.Configuration conf = HBaseConfiguration.create();
			conf.addResource(/home/akhld/Downloads/sparkhbasecode/hbase-site.xml);
			conf.set(TableInputFormat.INPUT_TABLE, blogposts);
			
		
			NewHadoopRDDImmutableBytesWritable, Result rdd = new NewHadoopRDDImmutableBytesWritable, Result(JavaSparkContext.toSparkContext(sc), TableInputFormat.class, ImmutableBytesWritable.class, Result.class, conf);
			
			JavaRDDTuple2ImmutableBytesWritable, Result jrdd = rdd.toJavaRDD();
		
			ForEachFunction f = new ForEachFunction();
			JavaRDDIteratorString retrdd = jrdd.map(f);
			System.out.println(Count = + retrdd.count());
			
		}catch(Exception e){
			
			e.printStackTrace();
			System.out.println(Crshed :  + e);
			
		}
		
	}
	
	@SuppressWarnings(serial)
private static class ForEachFunction extends FunctionTuple2ImmutableBytesWritable, Result, IteratorString{
   	public IteratorString call(Tuple2ImmutableBytesWritable, Result test) {
   		Result tmp = (Result) test._2;
ListKeyValue kvl = tmp.getColumn(post.getBytes(), title.getBytes());
for(KeyValue kl:kvl){
	String sb = new String(kl.getValue());
	System.out.println(Value : + sb);
}
   		return null;
}

 }


}

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark with HBase

2014-08-07 Thread chutium

this two posts should be good for setting up spark+hbase environment and use
the results of hbase table scan as RDD

settings
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html

some samples:
http://www.abcn.net/2014/07/spark-hbase-result-keyvalue-bytearray.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-HBase-tp11629p11647.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Spark with HBase

2014-07-04 Thread N . Venkata Naga Ravi

Hi,

Any update on the solution? We are still facing this issue...
We could able to connect to HBase with independent code, but getting issue with 
Spark integration.

Thx,
Ravi

From: nvn_r...@hotmail.com
To: u...@spark.incubator.apache.org; user@spark.apache.org
Subject: RE: Spark with HBase
Date: Sun, 29 Jun 2014 15:32:42 +0530




+user@spark.apache.org

From: nvn_r...@hotmail.com
To: u...@spark.incubator.apache.org
Subject: Spark with HBase
Date: Sun, 29 Jun 2014 15:28:43 +0530




I am using follwoing versiongs ..

spark-1.0.0-bin-hadoop2
hbase-0.96.1.1-hadoop2


When executing Hbase Test , i am facing following exception. Looks like some 
version incompatibility, can you please help on it.

NERAVI-M-70HY:spark-1.0.0-bin-hadoop2 neravi$ ./bin/run-example 
org.apache.spark.examples.HBaseTest local localhost:4040 test



14/06/29 15:14:14 INFO RecoverableZooKeeper: The identifier of this process is 
69...@neravi-m-70hy.cisco.com
14/06/29 15:14:14 INFO ClientCnxn: Opening socket connection to server 
localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL 
(unknown error)
14/06/29 15:14:14 INFO ClientCnxn: Socket connection established to 
localhost/0:0:0:0:0:0:0:1:2181, initiating session
14/06/29 15:14:14 INFO ClientCnxn: Session establishment complete on server 
localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x146e6fa10750009, negotiated 
timeout = 4
Exception in thread main java.lang.IllegalArgumentException: Not a host:port 
pair: PBUF


192.168.1.6�(
at org.apache.hadoop.hbase.util.Addressing.parseHostname(Addressing.java:60)
at org.apache.hadoop.hbase.ServerName.init(ServerName.java:101)
at 
org.apache.hadoop.hbase.ServerName.parseVersionedServerName(ServerName.java:283)
at 
org.apache.hadoop.hbase.MasterAddressTracker.bytesToServerName(MasterAddressTracker.java:77)
at 
org.apache.hadoop.hbase.MasterAddressTracker.getMasterAddress(MasterAddressTracker.java:61)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:703)
at org.apache.hadoop.hbase.client.HBaseAdmin.init(HBaseAdmin.java:126)
at org.apache.spark.examples.HBaseTest$.main(HBaseTest.scala:37)
at org.apache.spark.examples.HBaseTest.main(HBaseTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


Thanks,
Ravi

RE: Spark with HBase

2014-06-29 Thread N . Venkata Naga Ravi

+user@spark.apache.org

From: nvn_r...@hotmail.com
To: u...@spark.incubator.apache.org
Subject: Spark with HBase
Date: Sun, 29 Jun 2014 15:28:43 +0530

I am using follwoing versiongs ..

spark-1.0.0-bin-hadoop2
hbase-0.96.1.1-hadoop2

When executing Hbase Test , i am facing following exception. Looks like some 
version incompatibility, can you please help on it.

NERAVI-M-70HY:spark-1.0.0-bin-hadoop2 neravi$ ./bin/run-example 
org.apache.spark.examples.HBaseTest local localhost:4040 test

14/06/29 15:14:14 INFO RecoverableZooKeeper: The identifier of this process is 
69...@neravi-m-70hy.cisco.com
14/06/29 15:14:14 INFO ClientCnxn: Opening socket connection to server 
localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL 
(unknown error)
14/06/29 15:14:14 INFO ClientCnxn: Socket connection established to 
localhost/0:0:0:0:0:0:0:1:2181, initiating session
14/06/29 15:14:14 INFO ClientCnxn: Session establishment complete on server 
localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x146e6fa10750009, negotiated 
timeout = 4
Exception in thread main java.lang.IllegalArgumentException: Not a host:port 
pair: PBUF

192.168.1.6�(
at org.apache.hadoop.hbase.util.Addressing.parseHostname(Addressing.java:60)
at org.apache.hadoop.hbase.ServerName.init(ServerName.java:101)
at 
org.apache.hadoop.hbase.ServerName.parseVersionedServerName(ServerName.java:283)
at 
org.apache.hadoop.hbase.MasterAddressTracker.bytesToServerName(MasterAddressTracker.java:77)
at 
org.apache.hadoop.hbase.MasterAddressTracker.getMasterAddress(MasterAddressTracker.java:61)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:703)
at org.apache.hadoop.hbase.client.HBaseAdmin.init(HBaseAdmin.java:126)
at org.apache.spark.examples.HBaseTest$.main(HBaseTest.scala:37)
at org.apache.spark.examples.HBaseTest.main(HBaseTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Thanks,
Ravi

Re: Spark on HBase vs. Spark on HDFS

2014-05-23 Thread Mayur Rustagi

Also I am unsure if Spark on Hbase leverages Locality. When you cache 
process data do you see node_local jobs in process list.
Spark on HDFS leverages locality quite well  can really boost performance
by 3-4x in my experience.
If you are loading all your data from HBase to spark then you  are better
off using HDFS.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Thu, May 22, 2014 at 1:09 PM, Nick Pentreath nick.pentre...@gmail.comwrote:

 Hi

 In my opinion, running HBase for immutable data is generally overkill in
 particular if you are using Shark anyway to cache and analyse the data and
 provide the speed.

 HBase is designed for random-access data patterns and high throughput R/W
 activities. If you are only ever writing immutable logs, then that is what
 HDFS is designed for.

 Having said that, if you replace HBase you will need to come up with a
 reliable way to put data into HDFS (a log aggregator like Flume or message
 bus like Kafka perhaps, etc), so the pain of doing that may not be worth it
 given you already know HBase.


 On Thu, May 22, 2014 at 9:33 AM, Limbeck, Philip 
 philip.limb...@automic.com wrote:

  HI!



 We are currently using HBase as our primary data store of different
 event-like data. On-top of that, we use Shark to aggregate this data and
 keep it
 in memory for fast data access.  Since we use no specific HBase
 functionality whatsoever except Putting data into it, a discussion
 came up on having to set up an additional set of components on top of
 HDFS instead of just writing to HDFS directly.

  Is there any overview regarding implications of doing that ? I mean
 except things like taking care of file structure and the like. What is the
 true

 advantage of Spark on HBase in favor of Spark on HDFS?



 Best

 Philip

 Automic Software GmbH, Hauptstrasse 3C, 3012 Wolfsgraben
 Firmenbuchnummer/Commercial Register No. 275184h
 Firmenbuchgericht/Commercial Register Court: Landesgericht St. Poelten

 This email (including any attachments) may contain information which is
 privileged, confidential, or protected. If you are not the intended
 recipient, note that any disclosure, copying, distribution, or use of the
 contents of this message and attached files is prohibited. If you have
 received this email in error, please notify the sender and delete this
 email and any attached files.

Re: Spark on HBase vs. Spark on HDFS

2014-05-22 Thread Nick Pentreath

Hi

In my opinion, running HBase for immutable data is generally overkill in
particular if you are using Shark anyway to cache and analyse the data and
provide the speed.

HBase is designed for random-access data patterns and high throughput R/W
activities. If you are only ever writing immutable logs, then that is what
HDFS is designed for.

Having said that, if you replace HBase you will need to come up with a
reliable way to put data into HDFS (a log aggregator like Flume or message
bus like Kafka perhaps, etc), so the pain of doing that may not be worth it
given you already know HBase.


On Thu, May 22, 2014 at 9:33 AM, Limbeck, Philip philip.limb...@automic.com
 wrote:

  HI!



 We are currently using HBase as our primary data store of different
 event-like data. On-top of that, we use Shark to aggregate this data and
 keep it
 in memory for fast data access.  Since we use no specific HBase
 functionality whatsoever except Putting data into it, a discussion
 came up on having to set up an additional set of components on top of HDFS
 instead of just writing to HDFS directly.

  Is there any overview regarding implications of doing that ? I mean
 except things like taking care of file structure and the like. What is the
 true

 advantage of Spark on HBase in favor of Spark on HDFS?



 Best

 Philip

 Automic Software GmbH, Hauptstrasse 3C, 3012 Wolfsgraben
 Firmenbuchnummer/Commercial Register No. 275184h
 Firmenbuchgericht/Commercial Register Court: Landesgericht St. Poelten

 This email (including any attachments) may contain information which is
 privileged, confidential, or protected. If you are not the intended
 recipient, note that any disclosure, copying, distribution, or use of the
 contents of this message and attached files is prohibited. If you have
 received this email in error, please notify the sender and delete this
 email and any attached files.

Re: Spark and HBase

2014-04-26 Thread Nicholas Chammas

Thank you for sharing. Phoenix for realtime queries and Spark for more
complex batch processing seems like a potentially good combo.

I wonder if Spark's future will include support for the same kinds of
workloads that Phoenix is being built for. This little
tidbithttp://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.htmlabout
the future of Spark SQL seems to suggest just that (noting for others
reading that Phoenix is basically a SQL skin over HBase):

Look for future blog posts on the following topics:

- ...

- Reading and writing data using other formats and systems, include
Avro and HBase

I would certainly be nice to have one big data framework to rule them all.

Nick

On Sat, Apr 26, 2014 at 10:00 AM, Josh Mahonin jmaho...@filetrek.comwrote:

We're still in the infancy stages of the architecture for the project I'm
on, but presently we're investigating HBase / Phoenix data store for it's
realtime query abilities, and being able to expose data over a JDBC
connector is attractive for us.

Much of our data is event based, and many of the reports we'd like to do
can be accomplished using simple SQL queries on that data - assuming they
are performant. This far, the evidence is showing that it is even across
many millions of rows.

However, there are a number of models we have that today exist as a
combination of PIG and python batch jobs that I'd like to replace with
Spark, which thus far has shown to be more than adequate for what we're
doing today.

As far as using Phoenix as an endpoint for a batch load, the only real
advantage I see over using straight HBase is that I can specify a query to
prefilter the data before attaching it to an RDD. I haven't run the numbers
yet to see how this compare to more traditional methods though.

The only worry I have is that the Phoenix input format doesn't adequately
split the data across multiple nodes, so that's something I will need to
look at further.

Josh

On Apr 25, 2014, at 6:33 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:

Josh, is there a specific use pattern you think is served well by Phoenix
+ Spark? Just curious.

On Fri, Apr 25, 2014 at 3:17 PM, Josh Mahonin jmaho...@filetrek.comwrote:

Phoenix generally presents itself as an endpoint using JDBC, which in my
testing seems to play nicely using JdbcRDD.

However, a few days ago a patch was made against Phoenix to implement
support via PIG using a custom Hadoop InputFormat, which means now it has
Spark support too.

Here's a code snippet that sets up an RDD for a specific query:

--
val phoenixConf = new PhoenixPigConfiguration(new Configuration())
phoenixConf.setSelectStatement(SELECT EVENTTYPE,EVENTTIME FROM EVENTS
WHERE EVENTTYPE = 'some_type')
phoenixConf.setSelectColumns(EVENTTYPE,EVENTTIME)
phoenixConf.configure(servername, EVENTS, 100L)

val phoenixRDD = sc.newAPIHadoopRDD(
phoenixConf.getConfiguration(),
classOf[PhoenixInputFormat],
classOf[NullWritable],
classOf[PhoenixRecord])
--

I'm still very new at Spark and even less experienced with Phoenix, but
I'm hoping there's an advantage over the JdbcRDD in terms of partitioning.
The JdbcRDD seems to implement partitioning based on a query predicate that
is user defined, but I think Phoenix's InputFormat is able to figure out
the splits which Spark is able to leverage. I don't really know how to
verify if this is the case or not though, so if anyone else is looking into
this, I'd love to hear their thoughts.

Josh

On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:

Just took a quick look at the overview
herehttp://phoenix.incubator.apache.org/ and
the quick start guide
herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html
.

It looks like Apache Phoenix aims to provide flexible SQL access to
data, both for transactional and analytic purposes, and at interactive
speeds.

Nick

On Tue, Apr 8, 2014 at 12:38 PM, Bin Wang binwang...@gmail.com wrote:

First, I have not tried it myself. However, what I have heard it has
some basic SQL features so you can query you HBase table like query content
on HDFS using Hive.
So it is not query a simple column, I believe you can do joins and
other SQL queries. Maybe you can wrap up an EMR cluster with Hbase
preconfigured and give it a try.

Sorry cannot provide more detailed explanation and help.

On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier
pomperma...@okkam.it wrote:

Thanks for the quick reply Bin. Phenix is something I'm going to try
for sure but is seems somehow useless if I can use Spark.
Probably, as you said, since Phoenix use a dedicated data structure
within each HBase Table has a more effective memory usage but if I need to
deserialize data stored in a HBase cell I still have to read in memory
that
object and thus I need Spark. From what I understood Phoenix is

Re: Spark and HBase

2014-04-25 Thread Josh Mahonin

Phoenix generally presents itself as an endpoint using JDBC, which in my
testing seems to play nicely using JdbcRDD.

However, a few days ago a patch was made against Phoenix to implement
support via PIG using a custom Hadoop InputFormat, which means now it has
Spark support too.

Here's a code snippet that sets up an RDD for a specific query:

--
val phoenixConf = new PhoenixPigConfiguration(new Configuration())
phoenixConf.setSelectStatement(SELECT EVENTTYPE,EVENTTIME FROM EVENTS
WHERE EVENTTYPE = 'some_type')
phoenixConf.setSelectColumns(EVENTTYPE,EVENTTIME)
phoenixConf.configure(servername, EVENTS, 100L)

val phoenixRDD = sc.newAPIHadoopRDD(
phoenixConf.getConfiguration(),
classOf[PhoenixInputFormat],
  classOf[NullWritable],
  classOf[PhoenixRecord])
--

I'm still very new at Spark and even less experienced with Phoenix, but I'm
hoping there's an advantage over the JdbcRDD in terms of partitioning. The
JdbcRDD seems to implement partitioning based on a query predicate that is
user defined, but I think Phoenix's InputFormat is able to figure out the
splits which Spark is able to leverage. I don't really know how to verify
if this is the case or not though, so if anyone else is looking into this,
I'd love to hear their thoughts.

Josh


On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Just took a quick look at the overview 
 herehttp://phoenix.incubator.apache.org/ and
 the quick start guide 
 herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html
 .

 It looks like Apache Phoenix aims to provide flexible SQL access to data,
 both for transactional and analytic purposes, and at interactive speeds.

 Nick


 On Tue, Apr 8, 2014 at 12:38 PM, Bin Wang binwang...@gmail.com wrote:

 First, I have not tried it myself. However, what I have heard it has some
 basic SQL features so you can query you HBase table like query content on
 HDFS using Hive.
 So it is not query a simple column, I believe you can do joins and
 other SQL queries. Maybe you can wrap up an EMR cluster with Hbase
 preconfigured and give it a try.

 Sorry cannot provide more detailed explanation and help.



 On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier pomperma...@okkam.it
  wrote:

 Thanks for the quick reply Bin. Phenix is something I'm going to try for
 sure but is seems somehow useless if I can use Spark.
 Probably, as you said, since Phoenix use a dedicated data structure
 within each HBase Table has a more effective memory usage but if I need to
 deserialize data stored in a HBase cell I still have to read in memory that
 object and thus I need Spark. From what I understood Phoenix is good if I
 have to query a simple column of HBase but things get really complicated if
 I have to add an index for each column in my table and I store complex
 object within the cells. Is it correct?

 Best,
 Flavio




 On Tue, Apr 8, 2014 at 6:05 PM, Bin Wang binwang...@gmail.com wrote:

 Hi Flavio,

 I happened to attend, actually attending the 2014 Apache Conf, I heard
 a project called Apache Phoenix, which fully leverage HBase and suppose
 to be 1000x faster than Hive. And it is not memory bounded, in which case
 sets up a limit for Spark. It is still in the incubating group and the
 stats functions spark has already implemented are still on the roadmap. I
 am not sure whether it will be good but might be something interesting to
 check out.

 /usr/bin


 On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier 
 pomperma...@okkam.it wrote:

 Hi to everybody,

  in these days I looked a bit at the recent evolution of the big data
 stacks and it seems that HBase is somehow fading away in favour of
 Spark+HDFS. Am I correct?
 Do you think that Spark and HBase should work together or not?

 Best regards,
 Flavio

Re: Spark and HBase

2014-04-25 Thread Nicholas Chammas

Josh, is there a specific use pattern you think is served well by Phoenix +
Spark? Just curious.


On Fri, Apr 25, 2014 at 3:17 PM, Josh Mahonin jmaho...@filetrek.com wrote:

 Phoenix generally presents itself as an endpoint using JDBC, which in my
 testing seems to play nicely using JdbcRDD.

 However, a few days ago a patch was made against Phoenix to implement
 support via PIG using a custom Hadoop InputFormat, which means now it has
 Spark support too.

 Here's a code snippet that sets up an RDD for a specific query:

 --
 val phoenixConf = new PhoenixPigConfiguration(new Configuration())
 phoenixConf.setSelectStatement(SELECT EVENTTYPE,EVENTTIME FROM EVENTS
 WHERE EVENTTYPE = 'some_type')
 phoenixConf.setSelectColumns(EVENTTYPE,EVENTTIME)
 phoenixConf.configure(servername, EVENTS, 100L)

 val phoenixRDD = sc.newAPIHadoopRDD(
 phoenixConf.getConfiguration(),
 classOf[PhoenixInputFormat],
   classOf[NullWritable],
   classOf[PhoenixRecord])
 --

 I'm still very new at Spark and even less experienced with Phoenix, but
 I'm hoping there's an advantage over the JdbcRDD in terms of partitioning.
 The JdbcRDD seems to implement partitioning based on a query predicate that
 is user defined, but I think Phoenix's InputFormat is able to figure out
 the splits which Spark is able to leverage. I don't really know how to
 verify if this is the case or not though, so if anyone else is looking into
 this, I'd love to hear their thoughts.

 Josh


 On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Just took a quick look at the overview 
 herehttp://phoenix.incubator.apache.org/ and
 the quick start guide 
 herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html
 .

 It looks like Apache Phoenix aims to provide flexible SQL access to data,
 both for transactional and analytic purposes, and at interactive speeds.

 Nick


 On Tue, Apr 8, 2014 at 12:38 PM, Bin Wang binwang...@gmail.com wrote:

 First, I have not tried it myself. However, what I have heard it has
 some basic SQL features so you can query you HBase table like query content
 on HDFS using Hive.
 So it is not query a simple column, I believe you can do joins and
 other SQL queries. Maybe you can wrap up an EMR cluster with Hbase
 preconfigured and give it a try.

 Sorry cannot provide more detailed explanation and help.



 On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier 
 pomperma...@okkam.it wrote:

 Thanks for the quick reply Bin. Phenix is something I'm going to try
 for sure but is seems somehow useless if I can use Spark.
 Probably, as you said, since Phoenix use a dedicated data structure
 within each HBase Table has a more effective memory usage but if I need to
 deserialize data stored in a HBase cell I still have to read in memory that
 object and thus I need Spark. From what I understood Phoenix is good if I
 have to query a simple column of HBase but things get really complicated if
 I have to add an index for each column in my table and I store complex
 object within the cells. Is it correct?

 Best,
 Flavio




 On Tue, Apr 8, 2014 at 6:05 PM, Bin Wang binwang...@gmail.com wrote:

 Hi Flavio,

 I happened to attend, actually attending the 2014 Apache Conf, I heard
 a project called Apache Phoenix, which fully leverage HBase and suppose
 to be 1000x faster than Hive. And it is not memory bounded, in which case
 sets up a limit for Spark. It is still in the incubating group and the
 stats functions spark has already implemented are still on the roadmap. 
 I
 am not sure whether it will be good but might be something interesting to
 check out.

 /usr/bin


 On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier 
 pomperma...@okkam.it wrote:

 Hi to everybody,

  in these days I looked a bit at the recent evolution of the big
 data stacks and it seems that HBase is somehow fading away in favour of
 Spark+HDFS. Am I correct?
 Do you think that Spark and HBase should work together or not?

 Best regards,
 Flavio

Re: Spark and HBase

2014-04-08 Thread Bin Wang

Hi Flavio,

I happened to attend, actually attending the 2014 Apache Conf, I heard a
project called Apache Phoenix, which fully leverage HBase and suppose to
be 1000x faster than Hive. And it is not memory bounded, in which case sets
up a limit for Spark. It is still in the incubating group and the stats
functions spark has already implemented are still on the roadmap. I am not
sure whether it will be good but might be something interesting to check
out.

/usr/bin


On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier pomperma...@okkam.itwrote:

 Hi to everybody,

 in these days I looked a bit at the recent evolution of the big data
 stacks and it seems that HBase is somehow fading away in favour of
 Spark+HDFS. Am I correct?
 Do you think that Spark and HBase should work together or not?

 Best regards,
 Flavio

Re: Spark and HBase

2014-04-08 Thread Christopher Nguyen

Flavio, the two are best at two orthogonal use cases, HBase on the
transactional side, and Spark on the analytic side. Spark is not intended
for row-based random-access updates, while far more flexible and efficient
in dataset-scale aggregations and general computations.

So yes, you can easily see them deployed side-by-side in a given enterprise.

Sent while mobile. Pls excuse typos etc.
On Apr 8, 2014 5:58 AM, Flavio Pompermaier pomperma...@okkam.it wrote:

 Hi to everybody,

 in these days I looked a bit at the recent evolution of the big data
 stacks and it seems that HBase is somehow fading away in favour of
 Spark+HDFS. Am I correct?
 Do you think that Spark and HBase should work together or not?

 Best regards,
 Flavio

Re: Spark and HBase

2014-04-08 Thread Flavio Pompermaier

Thanks for the quick reply Bin. Phenix is something I'm going to try for
sure but is seems somehow useless if I can use Spark.
Probably, as you said, since Phoenix use a dedicated data structure within
each HBase Table has a more effective memory usage but if I need to
deserialize data stored in a HBase cell I still have to read in memory that
object and thus I need Spark. From what I understood Phoenix is good if I
have to query a simple column of HBase but things get really complicated if
I have to add an index for each column in my table and I store complex
object within the cells. Is it correct?

Best,
Flavio



On Tue, Apr 8, 2014 at 6:05 PM, Bin Wang binwang...@gmail.com wrote:

 Hi Flavio,

 I happened to attend, actually attending the 2014 Apache Conf, I heard a
 project called Apache Phoenix, which fully leverage HBase and suppose to
 be 1000x faster than Hive. And it is not memory bounded, in which case sets
 up a limit for Spark. It is still in the incubating group and the stats
 functions spark has already implemented are still on the roadmap. I am not
 sure whether it will be good but might be something interesting to check
 out.

 /usr/bin


 On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier 
 pomperma...@okkam.itwrote:

 Hi to everybody,

  in these days I looked a bit at the recent evolution of the big data
 stacks and it seems that HBase is somehow fading away in favour of
 Spark+HDFS. Am I correct?
 Do you think that Spark and HBase should work together or not?

 Best regards,
 Flavio

Re: Spark and HBase

2014-04-08 Thread Nicholas Chammas

Just took a quick look at the overview
herehttp://phoenix.incubator.apache.org/ and
the quick start guide
herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html
.

It looks like Apache Phoenix aims to provide flexible SQL access to data,
both for transactional and analytic purposes, and at interactive speeds.

Nick


On Tue, Apr 8, 2014 at 12:38 PM, Bin Wang binwang...@gmail.com wrote:

 First, I have not tried it myself. However, what I have heard it has some
 basic SQL features so you can query you HBase table like query content on
 HDFS using Hive.
 So it is not query a simple column, I believe you can do joins and other
 SQL queries. Maybe you can wrap up an EMR cluster with Hbase preconfigured
 and give it a try.

 Sorry cannot provide more detailed explanation and help.



 On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier 
 pomperma...@okkam.itwrote:

 Thanks for the quick reply Bin. Phenix is something I'm going to try for
 sure but is seems somehow useless if I can use Spark.
 Probably, as you said, since Phoenix use a dedicated data structure
 within each HBase Table has a more effective memory usage but if I need to
 deserialize data stored in a HBase cell I still have to read in memory that
 object and thus I need Spark. From what I understood Phoenix is good if I
 have to query a simple column of HBase but things get really complicated if
 I have to add an index for each column in my table and I store complex
 object within the cells. Is it correct?

 Best,
 Flavio




 On Tue, Apr 8, 2014 at 6:05 PM, Bin Wang binwang...@gmail.com wrote:

 Hi Flavio,

 I happened to attend, actually attending the 2014 Apache Conf, I heard a
 project called Apache Phoenix, which fully leverage HBase and suppose to
 be 1000x faster than Hive. And it is not memory bounded, in which case sets
 up a limit for Spark. It is still in the incubating group and the stats
 functions spark has already implemented are still on the roadmap. I am not
 sure whether it will be good but might be something interesting to check
 out.

 /usr/bin


 On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier pomperma...@okkam.it
  wrote:

 Hi to everybody,

  in these days I looked a bit at the recent evolution of the big data
 stacks and it seems that HBase is somehow fading away in favour of
 Spark+HDFS. Am I correct?
 Do you think that Spark and HBase should work together or not?

 Best regards,
 Flavio

37 matches

Mail list logo