Re: Spark Hbase job taking long time

Amit Singh Hora Tue, 12 Aug 2014 12:03:37 -0700

Hi ,
Today i created a table with 3 regions and 2 jobtrackers but still the
spark job is taking lot of time
I also noticed one thing that is the memory of client was increasing
linearly is it like spark job was first bringing the complete data in
memory?



On Thu, Aug 7, 2014 at 7:31 PM, Ted Yu [via Apache Spark User List] <
ml-node+s1001560n11651...@n3.nabble.com> wrote:

> Forgot to include user@
>
> Another email from Amit indicated that there is 1 region in his table.
> This wouldn't give you the benefit TableInputFormat is expected to deliver.
>
> Please split your table into multiple regions.
>
> See http://hbase.apache.org/book.html#d3593e6847 and related links.
>
> Cheers
>
>
> On Wed, Aug 6, 2014 at 6:41 AM, Ted Yu <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=11651&i=0>> wrote:
>
>> Can you try specifying some value (100, e.g.) for
>> "hbase.mapreduce.scan.cachedrows" in your conf ?
>>
>> bq.  table contains 10lakh rows
>>
>> How many rows are there in the table ?
>>
>> nit: Example uses classOf[TableInputFormat] instead of
>> TableInputFormat.class.
>>
>> Cheers
>>
>>
>> On Wed, Aug 6, 2014 at 5:54 AM, Amit Singh Hora <[hidden email]
>> <http://user/SendEmail.jtp?type=node&node=11651&i=1>> wrote:
>>
>>> Hi All,
>>>
>>> I am trying to run a SQL query on HBase using spark job ,till now i am
>>> able
>>> to get the desierd results but as the data set size increases Spark job
>>> is
>>> taking a long time
>>> I believe i am doing something wrong,as after going through documentation
>>> and videos discussing on  spark performance  it should not take more then
>>> couple of seconds.
>>>
>>> PFB code snippet
>>> HBase table contains 10lakh rows
>>>
>>> JavaPairRDD<ImmutableBytesWritable, Result> pairRdd = ctx
>>>                                 .newAPIHadoopRDD(conf,
>>> TableInputFormat.class,
>>>
>>> ImmutableBytesWritable.class,
>>>
>>> org.apache.hadoop.hbase.client.Result.class).cache();
>>>
>>> JavaRDD<Person> people = pairRdd
>>>                                 .map(new
>>> Function<Tuple2&lt;ImmutableBytesWritable, Result>, Person>() {
>>>
>>>                                         public Person
>>> call(Tuple2<ImmutableBytesWritable, Result> v1)
>>>                                                         throws Exception
>>> {
>>>
>>> System.out.println("comming");
>>>                                                 Person person = new
>>> Person();
>>>                                                 String
>>> key=Bytes.toString(v1._2.getRow());
>>>
>>> key=key.substring(0,key.lastIndexOf("_"));
>>>
>>> person.setCalling(Long.parseLong(key));
>>>
>>> person.setCalled(Bytes.toLong(v1._2.getValue(
>>>
>>> Bytes.toBytes("si"), Bytes.toBytes("called"))));
>>>
>>> person.setTime(Bytes.toLong(v1._2.getValue(
>>>
>>> Bytes.toBytes("si"), Bytes.toBytes("at"))));
>>>
>>>                                                 return person;
>>>                                         }
>>>                                 });
>>> JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class);
>>>                 schemaPeople.registerAsTable("people");
>>>
>>>                 // SQL can be run over RDDs that have been registered as
>>> tables.
>>>                 JavaSchemaRDD teenagers = sqlCtx
>>>                                 .sql("SELECT count(*) from people group
>>> by calling");
>>>                 teenagers.printSchema();
>>>
>>>
>>> I am running spark using start-all.sh script with 2 workers
>>>
>>> Any pointers will be of a great help
>>> Regards,
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Hbase-job-taking-long-time-tp11541.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=11651&i=2>
>>> For additional commands, e-mail: [hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=11651&i=3>
>>>
>>>
>>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Hbase-job-taking-long-time-tp11541p11651.html
>  To unsubscribe from Spark Hbase job taking long time, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=11541&code=aG9yYS5hbWl0QGdtYWlsLmNvbXwxMTU0MXw4OTIzNDIwNzY=>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Hbase-job-taking-long-time-tp11541p11998.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Hbase job taking long time

Reply via email to