Re: GroupBy on DataFrame taking too much time

Todd Nist Mon, 11 Jan 2016 10:03:07 -0800

Hi Rajeshwar Gaini,

dbtable can be any valid sql query, simple define it as a sub query,
something like:



  val query = "(SELECT country, count(*) FROM customer group by country) as
X"

  val df1 = sqlContext.read
    .format("jdbc")
    .option("url", url)
    .option("user", username)
    .option("password", pwd)
    .option("driver", "driverClassNameHere")
    .option("dbtable", query)
    .load()

Not sure if that's what your looking for or not.

HTH.

-Todd

On Mon, Jan 11, 2016 at 3:47 AM, Gaini Rajeshwar <
raja.rajeshwar2...@gmail.com> wrote:

> There is no problem with the sql read. When i do the following it is
> working fine.
>
> *val dataframe1 = sqlContext.load("jdbc", Map("url" ->
> "jdbc:postgresql://localhost/customerlogs?user=postgres&password=postgres",
> "dbtable" -> "customer"))*
>
> *dataframe1.filter("country = 'BA'").show()*
>
> On Mon, Jan 11, 2016 at 1:41 PM, Xingchi Wang <regrec...@gmail.com> wrote:
>
>> Error happend at the "Lost task 0.0 in stage 0.0", I think it is not the
>> "groupBy" problem, it's the sql read the "customer" table issue,
>> please check the jdbc link and the data is loaded successfully??
>>
>> Thanks
>> Xingchi
>>
>> 2016-01-11 15:43 GMT+08:00 Gaini Rajeshwar <raja.rajeshwar2...@gmail.com>
>> :
>>
>>> Hi All,
>>>
>>> I have a table named *customer *(customer_id, event, country, .... ) in
>>> postgreSQL database. This table is having more than 100 million rows.
>>>
>>> I want to know number of events from each country. To achieve that i am
>>> doing groupBY using spark as following.
>>>
>>> *val dataframe1 = sqlContext.load("jdbc", Map("url" ->
>>> "jdbc:postgresql://localhost/customerlogs?user=postgres&password=postgres",
>>> "dbtable" -> "customer"))*
>>>
>>>
>>> *dataframe1.groupBy("country").count().show()*
>>>
>>> above code seems to be getting complete customer table before doing
>>> groupBy. Because of that reason it is throwing the following error
>>>
>>> *16/01/11 12:49:04 WARN HeartbeatReceiver: Removing executor 0 with no
>>> recent heartbeats: 170758 ms exceeds timeout 120000 ms*
>>> *16/01/11 12:49:04 ERROR TaskSchedulerImpl: Lost executor 0 on
>>> 10.2.12.59 <http://10.2.12.59>: Executor heartbeat timed out after 170758
>>> ms*
>>> *16/01/11 12:49:04 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID
>>> 0, 10.2.12.59): ExecutorLostFailure (executor 0 exited caused by one of the
>>> running tasks) Reason: Executor heartbeat timed out after 170758 ms*
>>>
>>> I am using spark 1.6.0
>>>
>>> Is there anyway i can solve this ?
>>>
>>> Thanks,
>>> Rajeshwar Gaini.
>>>
>>
>>
>

Re: GroupBy on DataFrame taking too much time

Reply via email to