[jira] [Resolved] (HAWQ-1438) Analyze report error: relcache reference xxx is not owned by resource owner TopTransaction

2017-07-13 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI resolved HAWQ-1438.
---
   Resolution: Fixed
Fix Version/s: (was: backlog)

> Analyze report error: relcache reference xxx is not owned by resource owner 
> TopTransaction
> --
>
> Key: HAWQ-1438
> URL: https://issues.apache.org/jira/browse/HAWQ-1438
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: 2.3.0.0-incubating
>
>
> 2017-04-12 14:23:13.866064 
> BST,"mis_ig","ig",p124811,th-224249568,"10.33.188.8","5172",2017-04-12 
> 14:20:42 
> BST,76687174,con61,cmd16,seg-1,,,x76687174,sx1,"ERROR","XX000","relcache 
> reference e_event_1_0_102_1_prt_2 is not owned by resource owner 
> TopTransaction (resowner.c:766)",,"ANALYZE 
> mis_data_ig_account_details.e_event_1_0_102",0,,"resowner.c",766,"Stack trace:
> 1 0x8ce438 postgres errstart (elog.c:492)
> 2 0x8d01bb postgres elog_finish (elog.c:1443)
> 3 0x4ca5f4 postgres relation_close (heapam.c:1267)
> 4 0x5e7498 postgres analyzeStmt (analyze.c:728)
> 5 0x5e8a97 postgres analyzeStatement (analyze.c:274)
> 6 0x65c34c postgres vacuum (vacuum.c:319)
> 7 0x7f6172 postgres ProcessUtility (utility.c:1472)
> 8 0x7f1c3e postgres  (pquery.c:1974)
> 9 0x7f341e postgres  (pquery.c:2078)
> 10 0x7f5185 postgres PortalRun (pquery.c:1599)
> 11 0x7ee1f8 postgres PostgresMain (postgres.c:2782)
> 12 0x7a04f0 postgres  (postmaster.c:5486)
> 13 0x7a32b9 postgres PostmasterMain (postmaster.c:1459)
> 14 0x4a52b9 postgres main (main.c:226)
> 15 0x7fcaee7ded5d libc.so.6 __libc_start_main (??:0)
> 16 0x4a5339 postgres  (??:0)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (HAWQ-1438) Analyze report error: relcache reference xxx is not owned by resource owner TopTransaction

2017-05-03 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1438:
-

Assignee: Ming LI  (was: Ed Espino)

> Analyze report error: relcache reference xxx is not owned by resource owner 
> TopTransaction
> --
>
> Key: HAWQ-1438
> URL: https://issues.apache.org/jira/browse/HAWQ-1438
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: backlog
>
>
> 2017-04-12 14:23:13.866064 
> BST,"mis_ig","ig",p124811,th-224249568,"10.33.188.8","5172",2017-04-12 
> 14:20:42 
> BST,76687174,con61,cmd16,seg-1,,,x76687174,sx1,"ERROR","XX000","relcache 
> reference e_event_1_0_102_1_prt_2 is not owned by resource owner 
> TopTransaction (resowner.c:766)",,"ANALYZE 
> mis_data_ig_account_details.e_event_1_0_102",0,,"resowner.c",766,"Stack trace:
> 1 0x8ce438 postgres errstart (elog.c:492)
> 2 0x8d01bb postgres elog_finish (elog.c:1443)
> 3 0x4ca5f4 postgres relation_close (heapam.c:1267)
> 4 0x5e7498 postgres analyzeStmt (analyze.c:728)
> 5 0x5e8a97 postgres analyzeStatement (analyze.c:274)
> 6 0x65c34c postgres vacuum (vacuum.c:319)
> 7 0x7f6172 postgres ProcessUtility (utility.c:1472)
> 8 0x7f1c3e postgres  (pquery.c:1974)
> 9 0x7f341e postgres  (pquery.c:2078)
> 10 0x7f5185 postgres PortalRun (pquery.c:1599)
> 11 0x7ee1f8 postgres PostgresMain (postgres.c:2782)
> 12 0x7a04f0 postgres  (postmaster.c:5486)
> 13 0x7a32b9 postgres PostmasterMain (postmaster.c:1459)
> 14 0x4a52b9 postgres main (main.c:226)
> 15 0x7fcaee7ded5d libc.so.6 __libc_start_main (??:0)
> 16 0x4a5339 postgres  (??:0)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (HAWQ-1453) relation_close() report error at analyzeStmt(): is not owned by resource owner TopTransaction (resowner.c:814)

2017-05-03 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1453:
-

Assignee: Ming LI  (was: Ed Espino)

> relation_close() report error at analyzeStmt(): is not owned by resource 
> owner TopTransaction (resowner.c:814)
> --
>
> Key: HAWQ-1453
> URL: https://issues.apache.org/jira/browse/HAWQ-1453
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: backlog
>
>
> I created a simple MapReduce map-only program (to simulate a Spark executor, 
> like in the customer's environment) that uses JDBC through Postgresql Driver 
> (like the customer is doing) and I executed the queries the customer is 
> trying to execute. I can reproduce all the errors reported by customer.
> 2017-04-28 03:50:38.299276 
> IST,"gpadmin","gpadmin",p91745,th-609535712,"10.193.102.144","3228",2017-04-28
>  03:50:35 
> IST,156637,con4578,cmd36,seg-1,,,x156637,sx1,"ERROR","XX000","relcache 
> reference e_event_1_0_102_1_prt_2 is not owned by resource owner 
> TopTransaction (resowner.c:814)",,"ANALYZE 
> mis_data_ig_account_details.e_event_1_0_102",0,,"resowner.c",814,"Stack trace:
> 10x8ce4a8 postgres errstart + 0x288
> 20x8d022b postgres elog_finish + 0xab
> 30x4ca654 postgres relation_close + 0x14
> 40x5e7508 postgres analyzeStmt + 0xd58
> 50x5e8b07 postgres analyzeStatement + 0x97
> 60x65c3bc postgres vacuum + 0x6c
> 70x7f61e2 postgres ProcessUtility + 0x542
> 80x7f1cae postgres  + 0x7f1cae
> 90x7f348e postgres  + 0x7f348e
> 10   0x7f51f5 postgres PortalRun + 0x465
> 11   0x7ee268 postgres PostgresMain + 0x1908
> 12   0x7a0560 postgres  + 0x7a0560
> 13   0x7a3329 postgres PostmasterMain + 0x759
> 14   0x4a5319 postgres main + 0x519
> 15   0x3a1661ed1d libc.so.6 __libc_start_main + 0xfd
> 16   0x4a5399 postgres  + 0x4a5399
> "



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HAWQ-1453) relation_close() report error at analyzeStmt(): is not owned by resource owner TopTransaction (resowner.c:814)

2017-05-03 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994580#comment-15994580
 ] 

Ming LI commented on HAWQ-1453:
---

Although we can support resource owner beyond transaction block now, but the 
relation opened at TopResourceOwner, while close at a new transaction resource 
owner. So when we close these relations, we need to firstly switch back to the 
TopResourceOwner.

> relation_close() report error at analyzeStmt(): is not owned by resource 
> owner TopTransaction (resowner.c:814)
> --
>
> Key: HAWQ-1453
> URL: https://issues.apache.org/jira/browse/HAWQ-1453
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ed Espino
> Fix For: backlog
>
>
> I created a simple MapReduce map-only program (to simulate a Spark executor, 
> like in the customer's environment) that uses JDBC through Postgresql Driver 
> (like the customer is doing) and I executed the queries the customer is 
> trying to execute. I can reproduce all the errors reported by customer.
> 2017-04-28 03:50:38.299276 
> IST,"gpadmin","gpadmin",p91745,th-609535712,"10.193.102.144","3228",2017-04-28
>  03:50:35 
> IST,156637,con4578,cmd36,seg-1,,,x156637,sx1,"ERROR","XX000","relcache 
> reference e_event_1_0_102_1_prt_2 is not owned by resource owner 
> TopTransaction (resowner.c:814)",,"ANALYZE 
> mis_data_ig_account_details.e_event_1_0_102",0,,"resowner.c",814,"Stack trace:
> 10x8ce4a8 postgres errstart + 0x288
> 20x8d022b postgres elog_finish + 0xab
> 30x4ca654 postgres relation_close + 0x14
> 40x5e7508 postgres analyzeStmt + 0xd58
> 50x5e8b07 postgres analyzeStatement + 0x97
> 60x65c3bc postgres vacuum + 0x6c
> 70x7f61e2 postgres ProcessUtility + 0x542
> 80x7f1cae postgres  + 0x7f1cae
> 90x7f348e postgres  + 0x7f348e
> 10   0x7f51f5 postgres PortalRun + 0x465
> 11   0x7ee268 postgres PostgresMain + 0x1908
> 12   0x7a0560 postgres  + 0x7a0560
> 13   0x7a3329 postgres PostmasterMain + 0x759
> 14   0x4a5319 postgres main + 0x519
> 15   0x3a1661ed1d libc.so.6 __libc_start_main + 0xfd
> 16   0x4a5399 postgres  + 0x4a5399
> "



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HAWQ-1453) relation_close() report error at analyzeStmt(): is not owned by resource owner TopTransaction (resowner.c:814)

2017-05-03 Thread Ming LI (JIRA)
Ming LI created HAWQ-1453:
-

 Summary: relation_close() report error at analyzeStmt(): is not 
owned by resource owner TopTransaction (resowner.c:814)
 Key: HAWQ-1453
 URL: https://issues.apache.org/jira/browse/HAWQ-1453
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Core
Reporter: Ming LI
Assignee: Ed Espino
 Fix For: backlog


I created a simple MapReduce map-only program (to simulate a Spark executor, 
like in the customer's environment) that uses JDBC through Postgresql Driver 
(like the customer is doing) and I executed the queries the customer is trying 
to execute. I can reproduce all the errors reported by customer.

2017-04-28 03:50:38.299276 
IST,"gpadmin","gpadmin",p91745,th-609535712,"10.193.102.144","3228",2017-04-28 
03:50:35 IST,156637,con4578,cmd36,seg-1,,,x156637,sx1,"ERROR","XX000","relcache 
reference e_event_1_0_102_1_prt_2 is not owned by resource owner TopTransaction 
(resowner.c:814)",,"ANALYZE 
mis_data_ig_account_details.e_event_1_0_102",0,,"resowner.c",814,"Stack trace:
10x8ce4a8 postgres errstart + 0x288
20x8d022b postgres elog_finish + 0xab
30x4ca654 postgres relation_close + 0x14
40x5e7508 postgres analyzeStmt + 0xd58
50x5e8b07 postgres analyzeStatement + 0x97
60x65c3bc postgres vacuum + 0x6c
70x7f61e2 postgres ProcessUtility + 0x542
80x7f1cae postgres  + 0x7f1cae
90x7f348e postgres  + 0x7f348e
10   0x7f51f5 postgres PortalRun + 0x465
11   0x7ee268 postgres PostgresMain + 0x1908
12   0x7a0560 postgres  + 0x7a0560
13   0x7a3329 postgres PostmasterMain + 0x759
14   0x4a5319 postgres main + 0x519
15   0x3a1661ed1d libc.so.6 __libc_start_main + 0xfd
16   0x4a5399 postgres  + 0x4a5399
"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HAWQ-1448) Postmaster process hung at recv () on segment

2017-05-02 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992470#comment-15992470
 ] 

Ming LI commented on HAWQ-1448:
---

Instead of change connection type, here just change the hawq stop script, so 
that we keep the change minimal affect.

> Postmaster process hung at recv () on segment
> -
>
> Key: HAWQ-1448
> URL: https://issues.apache.org/jira/browse/HAWQ-1448
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Dispatcher
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: backlog
>
>
> Some process hung for almost 2 hours before quit.
> 4/13/17 8:13:36 AM PDT: Thread 1 (Thread 0x7f9c78eae920 (LWP 177517)):
> 4/13/17 8:13:36 AM PDT: #0 0x00322180ec2c in recv () from 
> /lib64/libpthread.so.0
> 4/13/17 8:13:36 AM PDT: #1 0x007847e8 in secure_read ()
> 4/13/17 8:13:36 AM PDT: #2 0x00793735 in pq_recvbuf ()
> 4/13/17 8:13:36 AM PDT: #3 0x007939b9 in pq_getbyte ()
> 4/13/17 8:13:36 AM PDT: #4 0x008e39a4 in SocketBackend ()
> 4/13/17 8:13:36 AM PDT: #5 0x008e3ddc in ReadCommand ()
> 4/13/17 8:13:36 AM PDT: #6 0x008ea8c3 in PostgresMain ()
> 4/13/17 8:13:36 AM PDT: #7 0x008944ff in BackendRun ()
> 4/13/17 8:13:36 AM PDT: #8 0x0089391e in BackendStartup ()
> 4/13/17 8:13:36 AM PDT: #9 0x0088d99a in ServerLoop ()
> 4/13/17 8:13:36 AM PDT: #10 0x0088c9a7 in PostmasterMain ()
> 4/13/17 8:13:36 AM PDT: #11 0x007a9d63 in main ()
> 4/13/17 8:13:36 AM PDT: -
> All postgres processes on all host are quit,  only postmaster on seg3 hung.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (HAWQ-1448) Postmaster process hung at recv () on segment

2017-05-02 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992314#comment-15992314
 ] 

Ming LI edited comment on HAWQ-1448 at 5/2/17 7:16 AM:
---

Below is the related log segmentdd/pg_log/hawq-2017-04-13_071837.csv  on seg3
{code}
2017-04-13 08:08:15.998769 
PDT,,,p23303,th20286610240,,,seg-1,"LOG","0","received smart 
shutdown request",,,0,,"postmaster.c",3447,
...
2017-04-13 08:08:43.228325 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","FD 4 having 
errors raised. errno 111",,,0,,"rmcomm_AsyncComm.c",188,
2017-04-13 08:08:43.228347 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","Resource 
manager socket connect has error raised.",,,0,,"rmcomm_Connect.c",100,
2017-04-13 08:08:43.228364 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","Segment's 
resource manager sending IMAlive message switches from master to 
standby",,,0,,"rmcomm_RMSEG2RM.c",168,
2017-04-13 08:08:43.228383 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","segment will send 
heart-beat to standby from now on",,,0,,"resourcemanager_RMSEG.c",285,
2017-04-13 08:09:13.280237 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","Resource manager 
discovered local host IPv4 address 127.0.0.1",,,0,,"network_utils.c",210,
2017-04-13 08:09:13.280294 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","Resource manager 
discovered local host IPv4 address 10.32.34.6",,,0,,"network_utils.c",210,
... LOOP THESE 6 LINES  
 ...
2017-04-13 10:03:55.869252 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","FD 4 having 
errors raised. errno 111",,,0,,"rmcomm_AsyncComm.c",188,
2017-04-13 10:03:55.869277 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","Resource 
manager socket connect has error raised.",,,0,,"rmcomm_Connect.c",100,
2017-04-13 10:03:55.869293 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","Segment's 
resource manager sending IMAlive message switches from master to 
standby",,,0,,"rmcomm_RMSEG2RM.c",168,
2017-04-13 10:03:55.869323 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","segment will send 
heart-beat to standby from now on",,,0,,"resourcemanager_RMSEG.c",285,
2017-04-13 10:04:01.249461 
PDT,"hawqsuperuser","olap_winowerr",p177517,th2028661024,"10.32.35.251","45247",2017-04-13
 08:04:00 PDT,0,con4354,,seg6,"LOG","08006","could not receive data from 
client: Connection reset by peer",,,0,,"pqcomm.c",842,
2017-04-13 10:04:01.249522 
PDT,"hawqsuperuser","olap_winowerr",p177517,th2028661024,"10.32.35.251","45247",2017-04-13
 08:04:00 PDT,0,con4354,,seg6,"LOG","08P01","unexpected EOF on client 
connection",,,0,,"postgres.c",443,
2017-04-13 10:04:01.252964 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","Segment RM 
exits.",,,0,,"resourcemanager.c",347,
2017-04-13 10:04:01.253027 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","Clean up handler 
in message server is called.",,,0,,"rmcomm_MessageServer.c",105,
2017-04-13 10:04:01.255779 
PDT,,,p23308,th20286610240,,,seg-1,"LOG","0","shutting 
down",,,0,,"xlog.c",7861,
2017-04-13 10:04:01.257902 
PDT,,,p23308,th20286610240,,,seg-1,"LOG","0","database system 
is shut down",,,0,,"xlog.c",7882,
{code}




was (Author: mli):
Below is the related log
{code}
2017-04-13 08:08:15.998769 
PDT,,,p23303,th20286610240,,,seg-1,"LOG","0","received smart 
shutdown request",,,0,,"postmaster.c",3447,
...
2017-04-13 08:08:43.228325 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","FD 4 having 
errors raised. errno 111",,,0,,"rmcomm_AsyncComm.c",188,
2017-04-13 08:08:43.228347 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","Resource 
manager socket connect has error raised.",,,0,,"rmcomm_Connect.c",100,
2017-04-13 08:08:43.228364 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","Segment's 
resource manager sending IMAlive message switches from master to 
standby",,,0,,"rmcomm_RMSEG2RM.c",168,
2017-04-13 08:08:43.228383 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","segment will send 
heart-beat to standby from now on",,,0,,"resourcemanager_RMSEG.c",285,
2017-04-13 08:09:13.280237 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","Resource manager 
discovered local host IPv4 address 127.0.0.1",,,0,,"network_utils.c",210,
2017-04-13 08:09:13.280294 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","Resource manager 
discovered local host IPv4 address 10.32.34.6",,,0,,"network_utils.c",210,
... LOOP THESE 6 LINES 

[jira] [Updated] (HAWQ-1448) Postmaster process hung at recv () on segment

2017-05-02 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI updated HAWQ-1448:
--
Description: 
Some process hung for almost 2 hours before quit.

4/13/17 8:13:36 AM PDT: Thread 1 (Thread 0x7f9c78eae920 (LWP 177517)):
4/13/17 8:13:36 AM PDT: #0 0x00322180ec2c in recv () from 
/lib64/libpthread.so.0
4/13/17 8:13:36 AM PDT: #1 0x007847e8 in secure_read ()
4/13/17 8:13:36 AM PDT: #2 0x00793735 in pq_recvbuf ()
4/13/17 8:13:36 AM PDT: #3 0x007939b9 in pq_getbyte ()
4/13/17 8:13:36 AM PDT: #4 0x008e39a4 in SocketBackend ()
4/13/17 8:13:36 AM PDT: #5 0x008e3ddc in ReadCommand ()
4/13/17 8:13:36 AM PDT: #6 0x008ea8c3 in PostgresMain ()
4/13/17 8:13:36 AM PDT: #7 0x008944ff in BackendRun ()
4/13/17 8:13:36 AM PDT: #8 0x0089391e in BackendStartup ()
4/13/17 8:13:36 AM PDT: #9 0x0088d99a in ServerLoop ()
4/13/17 8:13:36 AM PDT: #10 0x0088c9a7 in PostmasterMain ()
4/13/17 8:13:36 AM PDT: #11 0x007a9d63 in main ()
4/13/17 8:13:36 AM PDT: -

All postgres processes on all host are quit,  only postmaster on seg3 hung.

  was:
Some process hung for almost 2 hours before quit.

4/13/17 8:13:36 AM PDT: Thread 1 (Thread 0x7f9c78eae920 (LWP 177517)):
4/13/17 8:13:36 AM PDT: #0 0x00322180ec2c in recv () from 
/lib64/libpthread.so.0
4/13/17 8:13:36 AM PDT: #1 0x007847e8 in secure_read ()
4/13/17 8:13:36 AM PDT: #2 0x00793735 in pq_recvbuf ()
4/13/17 8:13:36 AM PDT: #3 0x007939b9 in pq_getbyte ()
4/13/17 8:13:36 AM PDT: #4 0x008e39a4 in SocketBackend ()
4/13/17 8:13:36 AM PDT: #5 0x008e3ddc in ReadCommand ()
4/13/17 8:13:36 AM PDT: #6 0x008ea8c3 in PostgresMain ()
4/13/17 8:13:36 AM PDT: #7 0x008944ff in BackendRun ()
4/13/17 8:13:36 AM PDT: #8 0x0089391e in BackendStartup ()
4/13/17 8:13:36 AM PDT: #9 0x0088d99a in ServerLoop ()
4/13/17 8:13:36 AM PDT: #10 0x0088c9a7 in PostmasterMain ()
4/13/17 8:13:36 AM PDT: #11 0x007a9d63 in main ()
4/13/17 8:13:36 AM PDT: -
All postgres processes on all host are quit, 


> Postmaster process hung at recv () on segment
> -
>
> Key: HAWQ-1448
> URL: https://issues.apache.org/jira/browse/HAWQ-1448
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Dispatcher
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: backlog
>
>
> Some process hung for almost 2 hours before quit.
> 4/13/17 8:13:36 AM PDT: Thread 1 (Thread 0x7f9c78eae920 (LWP 177517)):
> 4/13/17 8:13:36 AM PDT: #0 0x00322180ec2c in recv () from 
> /lib64/libpthread.so.0
> 4/13/17 8:13:36 AM PDT: #1 0x007847e8 in secure_read ()
> 4/13/17 8:13:36 AM PDT: #2 0x00793735 in pq_recvbuf ()
> 4/13/17 8:13:36 AM PDT: #3 0x007939b9 in pq_getbyte ()
> 4/13/17 8:13:36 AM PDT: #4 0x008e39a4 in SocketBackend ()
> 4/13/17 8:13:36 AM PDT: #5 0x008e3ddc in ReadCommand ()
> 4/13/17 8:13:36 AM PDT: #6 0x008ea8c3 in PostgresMain ()
> 4/13/17 8:13:36 AM PDT: #7 0x008944ff in BackendRun ()
> 4/13/17 8:13:36 AM PDT: #8 0x0089391e in BackendStartup ()
> 4/13/17 8:13:36 AM PDT: #9 0x0088d99a in ServerLoop ()
> 4/13/17 8:13:36 AM PDT: #10 0x0088c9a7 in PostmasterMain ()
> 4/13/17 8:13:36 AM PDT: #11 0x007a9d63 in main ()
> 4/13/17 8:13:36 AM PDT: -
> All postgres processes on all host are quit,  only postmaster on seg3 hung.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (HAWQ-1448) Postmaster process hung at recv () on segment

2017-05-01 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1448:
-

Assignee: Ming LI  (was: Ed Espino)

> Postmaster process hung at recv () on segment
> -
>
> Key: HAWQ-1448
> URL: https://issues.apache.org/jira/browse/HAWQ-1448
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Dispatcher
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: backlog
>
>
> Some process hung for almost 2 hours before quit.
> 4/13/17 8:13:36 AM PDT: Thread 1 (Thread 0x7f9c78eae920 (LWP 177517)):
> 4/13/17 8:13:36 AM PDT: #0 0x00322180ec2c in recv () from 
> /lib64/libpthread.so.0
> 4/13/17 8:13:36 AM PDT: #1 0x007847e8 in secure_read ()
> 4/13/17 8:13:36 AM PDT: #2 0x00793735 in pq_recvbuf ()
> 4/13/17 8:13:36 AM PDT: #3 0x007939b9 in pq_getbyte ()
> 4/13/17 8:13:36 AM PDT: #4 0x008e39a4 in SocketBackend ()
> 4/13/17 8:13:36 AM PDT: #5 0x008e3ddc in ReadCommand ()
> 4/13/17 8:13:36 AM PDT: #6 0x008ea8c3 in PostgresMain ()
> 4/13/17 8:13:36 AM PDT: #7 0x008944ff in BackendRun ()
> 4/13/17 8:13:36 AM PDT: #8 0x0089391e in BackendStartup ()
> 4/13/17 8:13:36 AM PDT: #9 0x0088d99a in ServerLoop ()
> 4/13/17 8:13:36 AM PDT: #10 0x0088c9a7 in PostmasterMain ()
> 4/13/17 8:13:36 AM PDT: #11 0x007a9d63 in main ()
> 4/13/17 8:13:36 AM PDT: -
> All postgres processes on all host are quit, 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HAWQ-1448) Postmaster process hung at recv () on segment

2017-05-01 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992315#comment-15992315
 ] 

Ming LI commented on HAWQ-1448:
---

The reason why signal terminate didn't sent:
For segment node, the connection from master will be regarded as a normal 
client connection, so if we shutdown normally, pg_ctrl cannot send signal 
terminate to postmaster until libpq connection timeout. So it will wait for a 
long time then reporting keepalive connection timeout.
The solution is to change the connection between master and segment to 
non-normal client connection. Need more investigation.

> Postmaster process hung at recv () on segment
> -
>
> Key: HAWQ-1448
> URL: https://issues.apache.org/jira/browse/HAWQ-1448
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Dispatcher
>Reporter: Ming LI
>Assignee: Ed Espino
> Fix For: backlog
>
>
> Some process hung for almost 2 hours before quit.
> 4/13/17 8:13:36 AM PDT: Thread 1 (Thread 0x7f9c78eae920 (LWP 177517)):
> 4/13/17 8:13:36 AM PDT: #0 0x00322180ec2c in recv () from 
> /lib64/libpthread.so.0
> 4/13/17 8:13:36 AM PDT: #1 0x007847e8 in secure_read ()
> 4/13/17 8:13:36 AM PDT: #2 0x00793735 in pq_recvbuf ()
> 4/13/17 8:13:36 AM PDT: #3 0x007939b9 in pq_getbyte ()
> 4/13/17 8:13:36 AM PDT: #4 0x008e39a4 in SocketBackend ()
> 4/13/17 8:13:36 AM PDT: #5 0x008e3ddc in ReadCommand ()
> 4/13/17 8:13:36 AM PDT: #6 0x008ea8c3 in PostgresMain ()
> 4/13/17 8:13:36 AM PDT: #7 0x008944ff in BackendRun ()
> 4/13/17 8:13:36 AM PDT: #8 0x0089391e in BackendStartup ()
> 4/13/17 8:13:36 AM PDT: #9 0x0088d99a in ServerLoop ()
> 4/13/17 8:13:36 AM PDT: #10 0x0088c9a7 in PostmasterMain ()
> 4/13/17 8:13:36 AM PDT: #11 0x007a9d63 in main ()
> 4/13/17 8:13:36 AM PDT: -
> All postgres processes on all host are quit, 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (HAWQ-1448) Postmaster process hung at recv () on segment

2017-05-01 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992314#comment-15992314
 ] 

Ming LI edited comment on HAWQ-1448 at 5/2/17 4:22 AM:
---

Below is the related log
{code}
2017-04-13 08:08:15.998769 
PDT,,,p23303,th20286610240,,,seg-1,"LOG","0","received smart 
shutdown request",,,0,,"postmaster.c",3447,
...
2017-04-13 08:08:43.228325 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","FD 4 having 
errors raised. errno 111",,,0,,"rmcomm_AsyncComm.c",188,
2017-04-13 08:08:43.228347 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","Resource 
manager socket connect has error raised.",,,0,,"rmcomm_Connect.c",100,
2017-04-13 08:08:43.228364 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","Segment's 
resource manager sending IMAlive message switches from master to 
standby",,,0,,"rmcomm_RMSEG2RM.c",168,
2017-04-13 08:08:43.228383 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","segment will send 
heart-beat to standby from now on",,,0,,"resourcemanager_RMSEG.c",285,
2017-04-13 08:09:13.280237 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","Resource manager 
discovered local host IPv4 address 127.0.0.1",,,0,,"network_utils.c",210,
2017-04-13 08:09:13.280294 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","Resource manager 
discovered local host IPv4 address 10.32.34.6",,,0,,"network_utils.c",210,
... LOOP THESE 6 LINES  
 ...
2017-04-13 10:03:55.869252 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","FD 4 having 
errors raised. errno 111",,,0,,"rmcomm_AsyncComm.c",188,
2017-04-13 10:03:55.869277 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","Resource 
manager socket connect has error raised.",,,0,,"rmcomm_Connect.c",100,
2017-04-13 10:03:55.869293 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","Segment's 
resource manager sending IMAlive message switches from master to 
standby",,,0,,"rmcomm_RMSEG2RM.c",168,
2017-04-13 10:03:55.869323 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","segment will send 
heart-beat to standby from now on",,,0,,"resourcemanager_RMSEG.c",285,
2017-04-13 10:04:01.249461 
PDT,"hawqsuperuser","olap_winowerr",p177517,th2028661024,"10.32.35.251","45247",2017-04-13
 08:04:00 PDT,0,con4354,,seg6,"LOG","08006","could not receive data from 
client: Connection reset by peer",,,0,,"pqcomm.c",842,
2017-04-13 10:04:01.249522 
PDT,"hawqsuperuser","olap_winowerr",p177517,th2028661024,"10.32.35.251","45247",2017-04-13
 08:04:00 PDT,0,con4354,,seg6,"LOG","08P01","unexpected EOF on client 
connection",,,0,,"postgres.c",443,
2017-04-13 10:04:01.252964 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","Segment RM 
exits.",,,0,,"resourcemanager.c",347,
2017-04-13 10:04:01.253027 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","Clean up handler 
in message server is called.",,,0,,"rmcomm_MessageServer.c",105,
2017-04-13 10:04:01.255779 
PDT,,,p23308,th20286610240,,,seg-1,"LOG","0","shutting 
down",,,0,,"xlog.c",7861,
2017-04-13 10:04:01.257902 
PDT,,,p23308,th20286610240,,,seg-1,"LOG","0","database system 
is shut down",,,0,,"xlog.c",7882,
{code}




was (Author: mli):
Below is the related log
```
2017-04-13 08:08:15.998769 
PDT,,,p23303,th20286610240,,,seg-1,"LOG","0","received smart 
shutdown request",,,0,,"postmaster.c",3447,
...
2017-04-13 08:08:43.228325 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","FD 4 having 
errors raised. errno 111",,,0,,"rmcomm_AsyncComm.c",188,
2017-04-13 08:08:43.228347 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","Resource 
manager socket connect has error raised.",,,0,,"rmcomm_Connect.c",100,
2017-04-13 08:08:43.228364 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","Segment's 
resource manager sending IMAlive message switches from master to 
standby",,,0,,"rmcomm_RMSEG2RM.c",168,
2017-04-13 08:08:43.228383 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","segment will send 
heart-beat to standby from now on",,,0,,"resourcemanager_RMSEG.c",285,
2017-04-13 08:09:13.280237 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","Resource manager 
discovered local host IPv4 address 127.0.0.1",,,0,,"network_utils.c",210,
2017-04-13 08:09:13.280294 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","Resource manager 
discovered local host IPv4 address 10.32.34.6",,,0,,"network_utils.c",210,
... LOOP THESE 6 LINES  
 

[jira] [Commented] (HAWQ-1448) Postmaster process hung at recv () on segment

2017-05-01 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992314#comment-15992314
 ] 

Ming LI commented on HAWQ-1448:
---

Below is the related log
```
2017-04-13 08:08:15.998769 
PDT,,,p23303,th20286610240,,,seg-1,"LOG","0","received smart 
shutdown request",,,0,,"postmaster.c",3447,
...
2017-04-13 08:08:43.228325 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","FD 4 having 
errors raised. errno 111",,,0,,"rmcomm_AsyncComm.c",188,
2017-04-13 08:08:43.228347 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","Resource 
manager socket connect has error raised.",,,0,,"rmcomm_Connect.c",100,
2017-04-13 08:08:43.228364 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","Segment's 
resource manager sending IMAlive message switches from master to 
standby",,,0,,"rmcomm_RMSEG2RM.c",168,
2017-04-13 08:08:43.228383 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","segment will send 
heart-beat to standby from now on",,,0,,"resourcemanager_RMSEG.c",285,
2017-04-13 08:09:13.280237 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","Resource manager 
discovered local host IPv4 address 127.0.0.1",,,0,,"network_utils.c",210,
2017-04-13 08:09:13.280294 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","Resource manager 
discovered local host IPv4 address 10.32.34.6",,,0,,"network_utils.c",210,
... LOOP THESE 6 LINES  
 ...
2017-04-13 10:03:55.869252 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","FD 4 having 
errors raised. errno 111",,,0,,"rmcomm_AsyncComm.c",188,
2017-04-13 10:03:55.869277 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","Resource 
manager socket connect has error raised.",,,0,,"rmcomm_Connect.c",100,
2017-04-13 10:03:55.869293 
PDT,,,p23310,th20286610240,,,seg-1,"WARNING","01000","Segment's 
resource manager sending IMAlive message switches from master to 
standby",,,0,,"rmcomm_RMSEG2RM.c",168,
2017-04-13 10:03:55.869323 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","segment will send 
heart-beat to standby from now on",,,0,,"resourcemanager_RMSEG.c",285,
2017-04-13 10:04:01.249461 
PDT,"hawqsuperuser","olap_winowerr",p177517,th2028661024,"10.32.35.251","45247",2017-04-13
 08:04:00 PDT,0,con4354,,seg6,"LOG","08006","could not receive data from 
client: Connection reset by peer",,,0,,"pqcomm.c",842,
2017-04-13 10:04:01.249522 
PDT,"hawqsuperuser","olap_winowerr",p177517,th2028661024,"10.32.35.251","45247",2017-04-13
 08:04:00 PDT,0,con4354,,seg6,"LOG","08P01","unexpected EOF on client 
connection",,,0,,"postgres.c",443,
2017-04-13 10:04:01.252964 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","Segment RM 
exits.",,,0,,"resourcemanager.c",347,
2017-04-13 10:04:01.253027 
PDT,,,p23310,th20286610240,,,seg-1,"LOG","0","Clean up handler 
in message server is called.",,,0,,"rmcomm_MessageServer.c",105,
2017-04-13 10:04:01.255779 
PDT,,,p23308,th20286610240,,,seg-1,"LOG","0","shutting 
down",,,0,,"xlog.c",7861,
2017-04-13 10:04:01.257902 
PDT,,,p23308,th20286610240,,,seg-1,"LOG","0","database system 
is shut down",,,0,,"xlog.c",7882,
```



> Postmaster process hung at recv () on segment
> -
>
> Key: HAWQ-1448
> URL: https://issues.apache.org/jira/browse/HAWQ-1448
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Dispatcher
>Reporter: Ming LI
>Assignee: Ed Espino
> Fix For: backlog
>
>
> Some process hung for almost 2 hours before quit.
> 4/13/17 8:13:36 AM PDT: Thread 1 (Thread 0x7f9c78eae920 (LWP 177517)):
> 4/13/17 8:13:36 AM PDT: #0 0x00322180ec2c in recv () from 
> /lib64/libpthread.so.0
> 4/13/17 8:13:36 AM PDT: #1 0x007847e8 in secure_read ()
> 4/13/17 8:13:36 AM PDT: #2 0x00793735 in pq_recvbuf ()
> 4/13/17 8:13:36 AM PDT: #3 0x007939b9 in pq_getbyte ()
> 4/13/17 8:13:36 AM PDT: #4 0x008e39a4 in SocketBackend ()
> 4/13/17 8:13:36 AM PDT: #5 0x008e3ddc in ReadCommand ()
> 4/13/17 8:13:36 AM PDT: #6 0x008ea8c3 in PostgresMain ()
> 4/13/17 8:13:36 AM PDT: #7 0x008944ff in BackendRun ()
> 4/13/17 8:13:36 AM PDT: #8 0x0089391e in BackendStartup ()
> 4/13/17 8:13:36 AM PDT: #9 0x0088d99a in ServerLoop ()
> 4/13/17 8:13:36 AM PDT: #10 0x0088c9a7 in PostmasterMain ()
> 4/13/17 8:13:36 AM PDT: #11 0x007a9d63 in main ()
> 4/13/17 8:13:36 AM PDT: -
> All postgres processes on all host are quit, 



--
This message was sent by Atlassian JIRA

[jira] [Created] (HAWQ-1448) Postmaster process hung at recv () on segment

2017-05-01 Thread Ming LI (JIRA)
Ming LI created HAWQ-1448:
-

 Summary: Postmaster process hung at recv () on segment
 Key: HAWQ-1448
 URL: https://issues.apache.org/jira/browse/HAWQ-1448
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Dispatcher
Reporter: Ming LI
Assignee: Ed Espino
 Fix For: backlog


Some process hung for almost 2 hours before quit.

4/13/17 8:13:36 AM PDT: Thread 1 (Thread 0x7f9c78eae920 (LWP 177517)):
4/13/17 8:13:36 AM PDT: #0 0x00322180ec2c in recv () from 
/lib64/libpthread.so.0
4/13/17 8:13:36 AM PDT: #1 0x007847e8 in secure_read ()
4/13/17 8:13:36 AM PDT: #2 0x00793735 in pq_recvbuf ()
4/13/17 8:13:36 AM PDT: #3 0x007939b9 in pq_getbyte ()
4/13/17 8:13:36 AM PDT: #4 0x008e39a4 in SocketBackend ()
4/13/17 8:13:36 AM PDT: #5 0x008e3ddc in ReadCommand ()
4/13/17 8:13:36 AM PDT: #6 0x008ea8c3 in PostgresMain ()
4/13/17 8:13:36 AM PDT: #7 0x008944ff in BackendRun ()
4/13/17 8:13:36 AM PDT: #8 0x0089391e in BackendStartup ()
4/13/17 8:13:36 AM PDT: #9 0x0088d99a in ServerLoop ()
4/13/17 8:13:36 AM PDT: #10 0x0088c9a7 in PostmasterMain ()
4/13/17 8:13:36 AM PDT: #11 0x007a9d63 in main ()
4/13/17 8:13:36 AM PDT: -
All postgres processes on all host are quit, 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (HAWQ-1438) Analyze report error: relcache reference xxx is not owned by resource owner TopTransaction

2017-04-21 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976843#comment-15976843
 ] 

Ming LI edited comment on HAWQ-1438 at 4/21/17 7:26 AM:


The PR abstract:
(1) Add TopResourceOwner to trace resource used beyond the transaction boundary.
(2) When to auto alloc a new TopResourceOwner for each process?
Generate a singleton TopResourceOwner when setting CurrentResourceOwner to NULL.
(3) If crashed at some code beyond the transaction boundary, we should manually 
create
a new resource owner, because we don't manually create TopResourceOwner for 
each process.
(4) When to call ResourceOwnerDelete(TopResourceOwner) automatically for each 
process?
Register it at on_proc_exit().


was (Author: mli):
The PR abstract:
(1) Add TopResourceOwner to trace resource used beyond the transaction boundary.
(2) When to auto alloc a new TopResourceOwner for each process?
Generate a singleton TopResourceOwner when setting CurrentResourceOwner to NULL.
(3) If crashed at some code beyond the transaction boundary, we should manually 
create
a new resource owner, because we don't manually create a new one for each 
process.
(4) When to call ResourceOwnerDelete(TopResourceOwner) automatically for each 
process?
Register it at on_proc_exit().

> Analyze report error: relcache reference xxx is not owned by resource owner 
> TopTransaction
> --
>
> Key: HAWQ-1438
> URL: https://issues.apache.org/jira/browse/HAWQ-1438
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ed Espino
> Fix For: backlog
>
>
> 2017-04-12 14:23:13.866064 
> BST,"mis_ig","ig",p124811,th-224249568,"10.33.188.8","5172",2017-04-12 
> 14:20:42 
> BST,76687174,con61,cmd16,seg-1,,,x76687174,sx1,"ERROR","XX000","relcache 
> reference e_event_1_0_102_1_prt_2 is not owned by resource owner 
> TopTransaction (resowner.c:766)",,"ANALYZE 
> mis_data_ig_account_details.e_event_1_0_102",0,,"resowner.c",766,"Stack trace:
> 1 0x8ce438 postgres errstart (elog.c:492)
> 2 0x8d01bb postgres elog_finish (elog.c:1443)
> 3 0x4ca5f4 postgres relation_close (heapam.c:1267)
> 4 0x5e7498 postgres analyzeStmt (analyze.c:728)
> 5 0x5e8a97 postgres analyzeStatement (analyze.c:274)
> 6 0x65c34c postgres vacuum (vacuum.c:319)
> 7 0x7f6172 postgres ProcessUtility (utility.c:1472)
> 8 0x7f1c3e postgres  (pquery.c:1974)
> 9 0x7f341e postgres  (pquery.c:2078)
> 10 0x7f5185 postgres PortalRun (pquery.c:1599)
> 11 0x7ee1f8 postgres PostgresMain (postgres.c:2782)
> 12 0x7a04f0 postgres  (postmaster.c:5486)
> 13 0x7a32b9 postgres PostmasterMain (postmaster.c:1459)
> 14 0x4a52b9 postgres main (main.c:226)
> 15 0x7fcaee7ded5d libc.so.6 __libc_start_main (??:0)
> 16 0x4a5339 postgres  (??:0)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (HAWQ-1438) Analyze report error: relcache reference xxx is not owned by resource owner TopTransaction

2017-04-21 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976843#comment-15976843
 ] 

Ming LI edited comment on HAWQ-1438 at 4/21/17 7:24 AM:


The PR abstract:
(1) Add TopResourceOwner to trace resource used beyond the transaction boundary.
(2) When to auto alloc a new TopResourceOwner for each process?
Generate a singleton TopResourceOwner when setting CurrentResourceOwner to NULL.
(3) If crashed at some code beyond the transaction boundary, we should manually 
create
a new resource owner, because we don't manually create a new one for each 
process.
(4) When to call ResourceOwnerDelete(TopResourceOwner) automatically for each 
process?
Register it at on_proc_exit().


was (Author: mli):
The PR abstract:
(1) Add TopResourceOwner to trace resource used beyond the transaction boundary.
(2) When to auto alloc a new TopResourceOwner for each process?
Generate a singleton TopResourceOwner when setting CurrentResourceOwner to NULL.
(3) If crashed at some code beyond the transaction boundary, we should manually 
create
a new resource owner, because we do manually create a new one for each process.
(4) When to call ResourceOwnerDelete(TopResourceOwner) automatically for each 
process?
Register it at on_proc_exit().

> Analyze report error: relcache reference xxx is not owned by resource owner 
> TopTransaction
> --
>
> Key: HAWQ-1438
> URL: https://issues.apache.org/jira/browse/HAWQ-1438
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ed Espino
> Fix For: backlog
>
>
> 2017-04-12 14:23:13.866064 
> BST,"mis_ig","ig",p124811,th-224249568,"10.33.188.8","5172",2017-04-12 
> 14:20:42 
> BST,76687174,con61,cmd16,seg-1,,,x76687174,sx1,"ERROR","XX000","relcache 
> reference e_event_1_0_102_1_prt_2 is not owned by resource owner 
> TopTransaction (resowner.c:766)",,"ANALYZE 
> mis_data_ig_account_details.e_event_1_0_102",0,,"resowner.c",766,"Stack trace:
> 1 0x8ce438 postgres errstart (elog.c:492)
> 2 0x8d01bb postgres elog_finish (elog.c:1443)
> 3 0x4ca5f4 postgres relation_close (heapam.c:1267)
> 4 0x5e7498 postgres analyzeStmt (analyze.c:728)
> 5 0x5e8a97 postgres analyzeStatement (analyze.c:274)
> 6 0x65c34c postgres vacuum (vacuum.c:319)
> 7 0x7f6172 postgres ProcessUtility (utility.c:1472)
> 8 0x7f1c3e postgres  (pquery.c:1974)
> 9 0x7f341e postgres  (pquery.c:2078)
> 10 0x7f5185 postgres PortalRun (pquery.c:1599)
> 11 0x7ee1f8 postgres PostgresMain (postgres.c:2782)
> 12 0x7a04f0 postgres  (postmaster.c:5486)
> 13 0x7a32b9 postgres PostmasterMain (postmaster.c:1459)
> 14 0x4a52b9 postgres main (main.c:226)
> 15 0x7fcaee7ded5d libc.so.6 __libc_start_main (??:0)
> 16 0x4a5339 postgres  (??:0)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (HAWQ-1438) Analyze report error: relcache reference xxx is not owned by resource owner TopTransaction

2017-04-20 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976843#comment-15976843
 ] 

Ming LI edited comment on HAWQ-1438 at 4/20/17 2:59 PM:


The PR abstract:
(1) Add TopResourceOwner to trace resource used beyond the transaction boundary.
(2) When to auto alloc a new TopResourceOwner for each process?
Generate a singleton TopResourceOwner when setting CurrentResourceOwner to NULL.
(3) If crashed at some code beyond the transaction boundary, we should manually 
create
a new resource owner, because we do manually create a new one for each process.
(4) When to call ResourceOwnerDelete(TopResourceOwner) automatically for each 
process?
Register it at on_proc_exit().


was (Author: mli):
The PR introduction:
(1) Add TopResourceOwner to trace resource used beyond the transaction boundary.
(2) When to auto alloc a new TopResourceOwner for each process?
Generate a singleton TopResourceOwner when setting CurrentResourceOwner to NULL.
(3) If crashed at some code beyond the transaction boundary, we should manually 
create
a new resource owner, because we do manually create a new one for each process.
(4) When to call ResourceOwnerDelete(TopResourceOwner) automatically for each 
process?
Register it at on_proc_exit().

> Analyze report error: relcache reference xxx is not owned by resource owner 
> TopTransaction
> --
>
> Key: HAWQ-1438
> URL: https://issues.apache.org/jira/browse/HAWQ-1438
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ed Espino
> Fix For: backlog
>
>
> 2017-04-12 14:23:13.866064 
> BST,"mis_ig","ig",p124811,th-224249568,"10.33.188.8","5172",2017-04-12 
> 14:20:42 
> BST,76687174,con61,cmd16,seg-1,,,x76687174,sx1,"ERROR","XX000","relcache 
> reference e_event_1_0_102_1_prt_2 is not owned by resource owner 
> TopTransaction (resowner.c:766)",,"ANALYZE 
> mis_data_ig_account_details.e_event_1_0_102",0,,"resowner.c",766,"Stack trace:
> 1 0x8ce438 postgres errstart (elog.c:492)
> 2 0x8d01bb postgres elog_finish (elog.c:1443)
> 3 0x4ca5f4 postgres relation_close (heapam.c:1267)
> 4 0x5e7498 postgres analyzeStmt (analyze.c:728)
> 5 0x5e8a97 postgres analyzeStatement (analyze.c:274)
> 6 0x65c34c postgres vacuum (vacuum.c:319)
> 7 0x7f6172 postgres ProcessUtility (utility.c:1472)
> 8 0x7f1c3e postgres  (pquery.c:1974)
> 9 0x7f341e postgres  (pquery.c:2078)
> 10 0x7f5185 postgres PortalRun (pquery.c:1599)
> 11 0x7ee1f8 postgres PostgresMain (postgres.c:2782)
> 12 0x7a04f0 postgres  (postmaster.c:5486)
> 13 0x7a32b9 postgres PostmasterMain (postmaster.c:1459)
> 14 0x4a52b9 postgres main (main.c:226)
> 15 0x7fcaee7ded5d libc.so.6 __libc_start_main (??:0)
> 16 0x4a5339 postgres  (??:0)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HAWQ-1438) Analyze report error: relcache reference xxx is not owned by resource owner TopTransaction

2017-04-20 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976843#comment-15976843
 ] 

Ming LI commented on HAWQ-1438:
---

The PR introduction:
(1) Add TopResourceOwner to trace resource used beyond the transaction boundary.
(2) When to auto alloc a new TopResourceOwner for each process?
Generate a singleton TopResourceOwner when setting CurrentResourceOwner to NULL.
(3) If crashed at some code beyond the transaction boundary, we should manually 
create
a new resource owner, because we do manually create a new one for each process.
(4) When to call ResourceOwnerDelete(TopResourceOwner) automatically for each 
process?
Register it at on_proc_exit().

> Analyze report error: relcache reference xxx is not owned by resource owner 
> TopTransaction
> --
>
> Key: HAWQ-1438
> URL: https://issues.apache.org/jira/browse/HAWQ-1438
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ed Espino
> Fix For: backlog
>
>
> 2017-04-12 14:23:13.866064 
> BST,"mis_ig","ig",p124811,th-224249568,"10.33.188.8","5172",2017-04-12 
> 14:20:42 
> BST,76687174,con61,cmd16,seg-1,,,x76687174,sx1,"ERROR","XX000","relcache 
> reference e_event_1_0_102_1_prt_2 is not owned by resource owner 
> TopTransaction (resowner.c:766)",,"ANALYZE 
> mis_data_ig_account_details.e_event_1_0_102",0,,"resowner.c",766,"Stack trace:
> 1 0x8ce438 postgres errstart (elog.c:492)
> 2 0x8d01bb postgres elog_finish (elog.c:1443)
> 3 0x4ca5f4 postgres relation_close (heapam.c:1267)
> 4 0x5e7498 postgres analyzeStmt (analyze.c:728)
> 5 0x5e8a97 postgres analyzeStatement (analyze.c:274)
> 6 0x65c34c postgres vacuum (vacuum.c:319)
> 7 0x7f6172 postgres ProcessUtility (utility.c:1472)
> 8 0x7f1c3e postgres  (pquery.c:1974)
> 9 0x7f341e postgres  (pquery.c:2078)
> 10 0x7f5185 postgres PortalRun (pquery.c:1599)
> 11 0x7ee1f8 postgres PostgresMain (postgres.c:2782)
> 12 0x7a04f0 postgres  (postmaster.c:5486)
> 13 0x7a32b9 postgres PostmasterMain (postmaster.c:1459)
> 14 0x4a52b9 postgres main (main.c:226)
> 15 0x7fcaee7ded5d libc.so.6 __libc_start_main (??:0)
> 16 0x4a5339 postgres  (??:0)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HAWQ-1438) Analyze report error: relcache reference xxx is not owned by resource owner TopTransaction

2017-04-20 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976335#comment-15976335
 ] 

Ming LI commented on HAWQ-1438:
---

This defect is caused by the fix at HAWQ-1417.

Now the resource owner can not be used beyond the transaction boundary. 

> Analyze report error: relcache reference xxx is not owned by resource owner 
> TopTransaction
> --
>
> Key: HAWQ-1438
> URL: https://issues.apache.org/jira/browse/HAWQ-1438
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ed Espino
> Fix For: backlog
>
>
> 2017-04-12 14:23:13.866064 
> BST,"mis_ig","ig",p124811,th-224249568,"10.33.188.8","5172",2017-04-12 
> 14:20:42 
> BST,76687174,con61,cmd16,seg-1,,,x76687174,sx1,"ERROR","XX000","relcache 
> reference e_event_1_0_102_1_prt_2 is not owned by resource owner 
> TopTransaction (resowner.c:766)",,"ANALYZE 
> mis_data_ig_account_details.e_event_1_0_102",0,,"resowner.c",766,"Stack trace:
> 1 0x8ce438 postgres errstart (elog.c:492)
> 2 0x8d01bb postgres elog_finish (elog.c:1443)
> 3 0x4ca5f4 postgres relation_close (heapam.c:1267)
> 4 0x5e7498 postgres analyzeStmt (analyze.c:728)
> 5 0x5e8a97 postgres analyzeStatement (analyze.c:274)
> 6 0x65c34c postgres vacuum (vacuum.c:319)
> 7 0x7f6172 postgres ProcessUtility (utility.c:1472)
> 8 0x7f1c3e postgres  (pquery.c:1974)
> 9 0x7f341e postgres  (pquery.c:2078)
> 10 0x7f5185 postgres PortalRun (pquery.c:1599)
> 11 0x7ee1f8 postgres PostgresMain (postgres.c:2782)
> 12 0x7a04f0 postgres  (postmaster.c:5486)
> 13 0x7a32b9 postgres PostmasterMain (postmaster.c:1459)
> 14 0x4a52b9 postgres main (main.c:226)
> 15 0x7fcaee7ded5d libc.so.6 __libc_start_main (??:0)
> 16 0x4a5339 postgres  (??:0)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HAWQ-1438) Analyze report error: relcache reference xxx is not owned by resource owner TopTransaction

2017-04-20 Thread Ming LI (JIRA)
Ming LI created HAWQ-1438:
-

 Summary: Analyze report error: relcache reference xxx is not owned 
by resource owner TopTransaction
 Key: HAWQ-1438
 URL: https://issues.apache.org/jira/browse/HAWQ-1438
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Core
Reporter: Ming LI
Assignee: Ed Espino
 Fix For: backlog


2017-04-12 14:23:13.866064 
BST,"mis_ig","ig",p124811,th-224249568,"10.33.188.8","5172",2017-04-12 14:20:42 
BST,76687174,con61,cmd16,seg-1,,,x76687174,sx1,"ERROR","XX000","relcache 
reference e_event_1_0_102_1_prt_2 is not owned by resource owner TopTransaction 
(resowner.c:766)",,"ANALYZE 
mis_data_ig_account_details.e_event_1_0_102",0,,"resowner.c",766,"Stack trace:
1 0x8ce438 postgres errstart (elog.c:492)
2 0x8d01bb postgres elog_finish (elog.c:1443)
3 0x4ca5f4 postgres relation_close (heapam.c:1267)
4 0x5e7498 postgres analyzeStmt (analyze.c:728)
5 0x5e8a97 postgres analyzeStatement (analyze.c:274)
6 0x65c34c postgres vacuum (vacuum.c:319)
7 0x7f6172 postgres ProcessUtility (utility.c:1472)
8 0x7f1c3e postgres  (pquery.c:1974)
9 0x7f341e postgres  (pquery.c:2078)
10 0x7f5185 postgres PortalRun (pquery.c:1599)
11 0x7ee1f8 postgres PostgresMain (postgres.c:2782)
12 0x7a04f0 postgres  (postmaster.c:5486)
13 0x7a32b9 postgres PostmasterMain (postmaster.c:1459)
14 0x4a52b9 postgres main (main.c:226)
15 0x7fcaee7ded5d libc.so.6 __libc_start_main (??:0)
16 0x4a5339 postgres  (??:0)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HAWQ-1423) Build error when make unittest-check on MacOS

2017-04-05 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI updated HAWQ-1423:
--
Description: 
This bug only exists on MacOS.

Reproduce Steps: 
{code}
1. ./configure 
2. make -j8 
3. cd src/backend
4. make unittest-check
{code}

Build log:
{code}
../../../../../src/test/unit/mock/backend/libpq/be-secure_mock.c:174:2: error: 
void function 'report_commerror'
  should not return a value [-Wreturn-type]
return (__MAYBE_UNUSED) mock();
^  ~~~
1 error generated.
make[4]: *** [../../../../../src/test/unit/mock/backend/libpq/be-secure_mock.o] 
Error 1
make[3]: *** [mockup-phony] Error 2
make[2]: *** [unittest-check] Error 2
make[1]: *** [unittest-check] Error 2
make: *** [unittest-check] Error 2
{code}

  was:
This bug only exists on MacOS.

Reproduce Steps: 
{code}
1. ./configure 
2. make -j8 
3. cd src/backend
4. make unittest-check
{code}

{code}
../../../../../src/test/unit/mock/backend/libpq/be-secure_mock.c:174:2: error: 
void function 'report_commerror'
  should not return a value [-Wreturn-type]
return (__MAYBE_UNUSED) mock();
^  ~~~
1 error generated.
make[4]: *** [../../../../../src/test/unit/mock/backend/libpq/be-secure_mock.o] 
Error 1
make[3]: *** [mockup-phony] Error 2
make[2]: *** [unittest-check] Error 2
make[1]: *** [unittest-check] Error 2
make: *** [unittest-check] Error 2
{code}


> Build error when make unittest-check on MacOS
> -
>
> Key: HAWQ-1423
> URL: https://issues.apache.org/jira/browse/HAWQ-1423
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Build
>Reporter: Ming LI
>Assignee: Ed Espino
> Fix For: backlog
>
>
> This bug only exists on MacOS.
> Reproduce Steps: 
> {code}
> 1. ./configure 
> 2. make -j8 
> 3. cd src/backend
> 4. make unittest-check
> {code}
> Build log:
> {code}
> ../../../../../src/test/unit/mock/backend/libpq/be-secure_mock.c:174:2: 
> error: void function 'report_commerror'
>   should not return a value [-Wreturn-type]
> return (__MAYBE_UNUSED) mock();
> ^  ~~~
> 1 error generated.
> make[4]: *** 
> [../../../../../src/test/unit/mock/backend/libpq/be-secure_mock.o] Error 1
> make[3]: *** [mockup-phony] Error 2
> make[2]: *** [unittest-check] Error 2
> make[1]: *** [unittest-check] Error 2
> make: *** [unittest-check] Error 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HAWQ-1423) Build error when make unittest-check on MacOS

2017-04-05 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI updated HAWQ-1423:
--
Description: 
This bug only exists on MacOS.

Reproduce Steps: 
{code}
1. ./configure 
2. make -j8 
3. cd src/backend
4. make unittest-check
{code}

{code}
../../../../../src/test/unit/mock/backend/libpq/be-secure_mock.c:174:2: error: 
void function 'report_commerror'
  should not return a value [-Wreturn-type]
return (__MAYBE_UNUSED) mock();
^  ~~~
1 error generated.
make[4]: *** [../../../../../src/test/unit/mock/backend/libpq/be-secure_mock.o] 
Error 1
make[3]: *** [mockup-phony] Error 2
make[2]: *** [unittest-check] Error 2
make[1]: *** [unittest-check] Error 2
make: *** [unittest-check] Error 2
{code}

  was:
This bug only exists on MacOS.

Reproduce Steps: 
1. ./configure 
2. make -j8 
3. cd src/backend
4. make unittest-check

../../../../../src/test/unit/mock/backend/libpq/be-secure_mock.c:174:2: error: 
void function 'report_commerror'
  should not return a value [-Wreturn-type]
return (__MAYBE_UNUSED) mock();
^  ~~~
1 error generated.
make[4]: *** [../../../../../src/test/unit/mock/backend/libpq/be-secure_mock.o] 
Error 1
make[3]: *** [mockup-phony] Error 2
make[2]: *** [unittest-check] Error 2
make[1]: *** [unittest-check] Error 2
make: *** [unittest-check] Error 2


> Build error when make unittest-check on MacOS
> -
>
> Key: HAWQ-1423
> URL: https://issues.apache.org/jira/browse/HAWQ-1423
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Build
>Reporter: Ming LI
>Assignee: Ed Espino
> Fix For: backlog
>
>
> This bug only exists on MacOS.
> Reproduce Steps: 
> {code}
> 1. ./configure 
> 2. make -j8 
> 3. cd src/backend
> 4. make unittest-check
> {code}
> {code}
> ../../../../../src/test/unit/mock/backend/libpq/be-secure_mock.c:174:2: 
> error: void function 'report_commerror'
>   should not return a value [-Wreturn-type]
> return (__MAYBE_UNUSED) mock();
> ^  ~~~
> 1 error generated.
> make[4]: *** 
> [../../../../../src/test/unit/mock/backend/libpq/be-secure_mock.o] Error 1
> make[3]: *** [mockup-phony] Error 2
> make[2]: *** [unittest-check] Error 2
> make[1]: *** [unittest-check] Error 2
> make: *** [unittest-check] Error 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HAWQ-1423) Build error when make unittest-check on MacOS

2017-04-05 Thread Ming LI (JIRA)
Ming LI created HAWQ-1423:
-

 Summary: Build error when make unittest-check on MacOS
 Key: HAWQ-1423
 URL: https://issues.apache.org/jira/browse/HAWQ-1423
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Build
Reporter: Ming LI
Assignee: Ed Espino
 Fix For: backlog


reproduce: 
1. ./configure 
2. make -j8 
3. cd src/backend
4. make unittest-check

../../../../../src/test/unit/mock/backend/libpq/be-secure_mock.c:174:2: error: 
void function 'report_commerror'
  should not return a value [-Wreturn-type]
return (__MAYBE_UNUSED) mock();
^  ~~~
1 error generated.
make[4]: *** [../../../../../src/test/unit/mock/backend/libpq/be-secure_mock.o] 
Error 1
make[3]: *** [mockup-phony] Error 2
make[2]: *** [unittest-check] Error 2
make[1]: *** [unittest-check] Error 2
make: *** [unittest-check] Error 2



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (HAWQ-1417) Crashed at ANALYZE after COPY

2017-03-28 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944722#comment-15944722
 ] 

Ming LI edited comment on HAWQ-1417 at 3/28/17 8:02 AM:


The root cause is:  when ANALYZE is running, it will commit previous 
transaction, and begin one transaction for each relation to process analyze. 
However some code added before begining transactions, which means in this time, 
no transaction, and no resource owner neither.

The solution is to: create one resource owner for analyzeStmt(), and delete it 
after processing it.


was (Author: mli):
The root cause is:  when ANALYZE is running, it will commit previous 
transaction, and begin one transaction for each relation to process analyze. 
However some code added before begining transactions, which means in this time, 
no transaction, and no resource owner either.

The solution is to: create one resource owner for analyzeStmt(), and delete it 
after processing it.

> Crashed at ANALYZE after COPY
> -
>
> Key: HAWQ-1417
> URL: https://issues.apache.org/jira/browse/HAWQ-1417
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: backlog
>
>
> This is the line in the master log where the PANIC is reported:
> {code}
> (gdb) bt
> #0  0x7f6d35b0e6ab in raise () from 
> /data/logs/52280/new_panic/packcore-core.postgres.457052/lib64/libpthread.so.0
> #1  0x008c7d79 in SafeHandlerForSegvBusIll (postgres_signal_arg=11, 
> processName=) at elog.c:4519
> #2  
> #3  ResourceOwnerEnlargeRelationRefs (owner=0x0) at resowner.c:708
> #4  0x008b5659 in RelationIncrementReferenceCount (rel=0x1baf500) at 
> relcache.c:1941
> #5  RelationIdGetRelation (relationId=relationId@entry=1259) at 
> relcache.c:1895
> #6  0x004ca664 in relation_open (lockmode=lockmode@entry=1, 
> relationId=relationId@entry=1259) at heapam.c:882
> #7  heap_open (relationId=relationId@entry=1259, lockmode=lockmode@entry=1) 
> at heapam.c:1285
> #8  0x008b0945 in ScanPgRelation (targetRelId=targetRelId@entry=5010, 
> indexOK=indexOK@entry=1 '\001', 
> pg_class_relation=pg_class_relation@entry=0x7ffdf2aed390) at relcache.c:279
> #9  0x008b4302 in RelationBuildDesc (targetRelId=5010, 
> insertIt=) at relcache.c:1209
> #10 0x008b56c7 in RelationIdGetRelation 
> (relationId=relationId@entry=5010) at relcache.c:1918
> #11 0x004ca664 in relation_open (lockmode=, 
> relationId=5010) at heapam.c:882
> #12 heap_open (relationId=5010, lockmode=) at heapam.c:1285
> #13 0x0055d1e6 in caql_basic_fn_all (pcql=0x1d70a58, 
> bLockEntireTable=0 '\000', pCtx=0x7ffdf2aed480, pchn=0xf4b328 
> ) at caqlanalyze.c:343
> #14 caql_switch (pchn=pchn@entry=0xf4b328 , 
> pCtx=pCtx@entry=0x7ffdf2aed480, pcql=pcql@entry=0x1d70a58) at 
> caqlanalyze.c:229
> #15 0x005636db in caql_getcount (pCtx0=pCtx0@entry=0x0, 
> pcql=0x1d70a58) at caqlaccess.c:367
> #16 0x009ddc47 in rel_is_partitioned (relid=1882211) at 
> cdbpartition.c:232
> #17 rel_part_status (relid=relid@entry=1882211) at cdbpartition.c:484
> #18 0x005e7d43 in calculate_virtual_segment_number 
> (candidateOids=) at analyze.c:833
> #19 analyzeStmt (stmt=stmt@entry=0x2045dd0, relids=relids@entry=0x0, 
> preferred_seg_num=preferred_seg_num@entry=-1) at analyze.c:486
> #20 0x005e89a7 in analyzeStatement (stmt=stmt@entry=0x2045dd0, 
> relids=relids@entry=0x0, preferred_seg_num=preferred_seg_num@entry=-1) at 
> analyze.c:271
> #21 0x0065c25c in vacuum (vacstmt=vacstmt@entry=0x2045bf0, 
> relids=relids@entry=0x0, preferred_seg_num=preferred_seg_num@entry=-1) at 
> vacuum.c:316
> #22 0x007f6012 in ProcessUtility 
> (parsetree=parsetree@entry=0x2045bf0, queryString=0x2045d30 "ANALYZE 
> mis_data_ig_account_details.e_event_1_0_102", params=0x0, 
> isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0xf04ba0 ,
> completionTag=completionTag@entry=0x7ffdf2aee3f0 "") at utility.c:1471
> #23 0x007f1ade in PortalRunUtility (portal=portal@entry=0x1bfb490, 
> utilityStmt=utilityStmt@entry=0x2045bf0, isTopLevel=isTopLevel@entry=1 
> '\001', dest=dest@entry=0xf04ba0 , 
> completionTag=completionTag@entry=0x7ffdf2aee3f0 "") at pquery.c:1968
> #24 0x007f32be in PortalRunMulti (portal=portal@entry=0x1bfb490, 
> isTopLevel=isTopLevel@entry=1 '\001', dest=0xf04ba0 , 
> dest@entry=0x1b93f60, altdest=0xf04ba0 , 
> altdest@entry=0x1b93f60, completionTag=completionTag@entry=0x7ffdf2aee3f0 "")
> at pquery.c:2078
> #25 0x007f5025 in PortalRun (portal=portal@entry=0x1bfb490, 
> count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
> dest=dest@entry=0x1b93f60, altdest=altdest@entry=0x1b93f60, 
> 

[jira] [Commented] (HAWQ-1417) Crashed at ANALYZE after COPY

2017-03-28 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944722#comment-15944722
 ] 

Ming LI commented on HAWQ-1417:
---

The root cause is:  when ANALYZE is running, it will commit previous 
transaction, and begin one transaction for each relation to process analyze. 
However some code added before begining transactions, which means in this time, 
no transaction, and no resource owner either.

The solution is to: create one resource owner for analyzeStmt(), and delete it 
after processing it.

> Crashed at ANALYZE after COPY
> -
>
> Key: HAWQ-1417
> URL: https://issues.apache.org/jira/browse/HAWQ-1417
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: backlog
>
>
> This is the line in the master log where the PANIC is reported:
> {code}
> (gdb) bt
> #0  0x7f6d35b0e6ab in raise () from 
> /data/logs/52280/new_panic/packcore-core.postgres.457052/lib64/libpthread.so.0
> #1  0x008c7d79 in SafeHandlerForSegvBusIll (postgres_signal_arg=11, 
> processName=) at elog.c:4519
> #2  
> #3  ResourceOwnerEnlargeRelationRefs (owner=0x0) at resowner.c:708
> #4  0x008b5659 in RelationIncrementReferenceCount (rel=0x1baf500) at 
> relcache.c:1941
> #5  RelationIdGetRelation (relationId=relationId@entry=1259) at 
> relcache.c:1895
> #6  0x004ca664 in relation_open (lockmode=lockmode@entry=1, 
> relationId=relationId@entry=1259) at heapam.c:882
> #7  heap_open (relationId=relationId@entry=1259, lockmode=lockmode@entry=1) 
> at heapam.c:1285
> #8  0x008b0945 in ScanPgRelation (targetRelId=targetRelId@entry=5010, 
> indexOK=indexOK@entry=1 '\001', 
> pg_class_relation=pg_class_relation@entry=0x7ffdf2aed390) at relcache.c:279
> #9  0x008b4302 in RelationBuildDesc (targetRelId=5010, 
> insertIt=) at relcache.c:1209
> #10 0x008b56c7 in RelationIdGetRelation 
> (relationId=relationId@entry=5010) at relcache.c:1918
> #11 0x004ca664 in relation_open (lockmode=, 
> relationId=5010) at heapam.c:882
> #12 heap_open (relationId=5010, lockmode=) at heapam.c:1285
> #13 0x0055d1e6 in caql_basic_fn_all (pcql=0x1d70a58, 
> bLockEntireTable=0 '\000', pCtx=0x7ffdf2aed480, pchn=0xf4b328 
> ) at caqlanalyze.c:343
> #14 caql_switch (pchn=pchn@entry=0xf4b328 , 
> pCtx=pCtx@entry=0x7ffdf2aed480, pcql=pcql@entry=0x1d70a58) at 
> caqlanalyze.c:229
> #15 0x005636db in caql_getcount (pCtx0=pCtx0@entry=0x0, 
> pcql=0x1d70a58) at caqlaccess.c:367
> #16 0x009ddc47 in rel_is_partitioned (relid=1882211) at 
> cdbpartition.c:232
> #17 rel_part_status (relid=relid@entry=1882211) at cdbpartition.c:484
> #18 0x005e7d43 in calculate_virtual_segment_number 
> (candidateOids=) at analyze.c:833
> #19 analyzeStmt (stmt=stmt@entry=0x2045dd0, relids=relids@entry=0x0, 
> preferred_seg_num=preferred_seg_num@entry=-1) at analyze.c:486
> #20 0x005e89a7 in analyzeStatement (stmt=stmt@entry=0x2045dd0, 
> relids=relids@entry=0x0, preferred_seg_num=preferred_seg_num@entry=-1) at 
> analyze.c:271
> #21 0x0065c25c in vacuum (vacstmt=vacstmt@entry=0x2045bf0, 
> relids=relids@entry=0x0, preferred_seg_num=preferred_seg_num@entry=-1) at 
> vacuum.c:316
> #22 0x007f6012 in ProcessUtility 
> (parsetree=parsetree@entry=0x2045bf0, queryString=0x2045d30 "ANALYZE 
> mis_data_ig_account_details.e_event_1_0_102", params=0x0, 
> isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0xf04ba0 ,
> completionTag=completionTag@entry=0x7ffdf2aee3f0 "") at utility.c:1471
> #23 0x007f1ade in PortalRunUtility (portal=portal@entry=0x1bfb490, 
> utilityStmt=utilityStmt@entry=0x2045bf0, isTopLevel=isTopLevel@entry=1 
> '\001', dest=dest@entry=0xf04ba0 , 
> completionTag=completionTag@entry=0x7ffdf2aee3f0 "") at pquery.c:1968
> #24 0x007f32be in PortalRunMulti (portal=portal@entry=0x1bfb490, 
> isTopLevel=isTopLevel@entry=1 '\001', dest=0xf04ba0 , 
> dest@entry=0x1b93f60, altdest=0xf04ba0 , 
> altdest@entry=0x1b93f60, completionTag=completionTag@entry=0x7ffdf2aee3f0 "")
> at pquery.c:2078
> #25 0x007f5025 in PortalRun (portal=portal@entry=0x1bfb490, 
> count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
> dest=dest@entry=0x1b93f60, altdest=altdest@entry=0x1b93f60, 
> completionTag=completionTag@entry=0x7ffdf2aee3f0 "") at pquery.c:1595
> #26 0x007ee098 in exec_execute_message (max_rows=9223372036854775807, 
> portal_name=0x1b93ad0 "") at postgres.c:2782
> #27 PostgresMain (argc=, argv=, 
> argv@entry=0x1a49b40, username=0x1a498f0 "mis_ig") at postgres.c:5170
> #28 0x007a0390 in BackendRun (port=0x1a185f0) at postmaster.c:5915
> #29 BackendStartup (port=0x1a185f0) at postmaster.c:5484
> #30 ServerLoop 

[jira] [Assigned] (HAWQ-1417) Crashed at ANALYZE after COPY

2017-03-28 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1417:
-

Assignee: Ming LI  (was: Ed Espino)

> Crashed at ANALYZE after COPY
> -
>
> Key: HAWQ-1417
> URL: https://issues.apache.org/jira/browse/HAWQ-1417
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: backlog
>
>
> This is the line in the master log where the PANIC is reported:
> {code}
> (gdb) bt
> #0  0x7f6d35b0e6ab in raise () from 
> /data/logs/52280/new_panic/packcore-core.postgres.457052/lib64/libpthread.so.0
> #1  0x008c7d79 in SafeHandlerForSegvBusIll (postgres_signal_arg=11, 
> processName=) at elog.c:4519
> #2  
> #3  ResourceOwnerEnlargeRelationRefs (owner=0x0) at resowner.c:708
> #4  0x008b5659 in RelationIncrementReferenceCount (rel=0x1baf500) at 
> relcache.c:1941
> #5  RelationIdGetRelation (relationId=relationId@entry=1259) at 
> relcache.c:1895
> #6  0x004ca664 in relation_open (lockmode=lockmode@entry=1, 
> relationId=relationId@entry=1259) at heapam.c:882
> #7  heap_open (relationId=relationId@entry=1259, lockmode=lockmode@entry=1) 
> at heapam.c:1285
> #8  0x008b0945 in ScanPgRelation (targetRelId=targetRelId@entry=5010, 
> indexOK=indexOK@entry=1 '\001', 
> pg_class_relation=pg_class_relation@entry=0x7ffdf2aed390) at relcache.c:279
> #9  0x008b4302 in RelationBuildDesc (targetRelId=5010, 
> insertIt=) at relcache.c:1209
> #10 0x008b56c7 in RelationIdGetRelation 
> (relationId=relationId@entry=5010) at relcache.c:1918
> #11 0x004ca664 in relation_open (lockmode=, 
> relationId=5010) at heapam.c:882
> #12 heap_open (relationId=5010, lockmode=) at heapam.c:1285
> #13 0x0055d1e6 in caql_basic_fn_all (pcql=0x1d70a58, 
> bLockEntireTable=0 '\000', pCtx=0x7ffdf2aed480, pchn=0xf4b328 
> ) at caqlanalyze.c:343
> #14 caql_switch (pchn=pchn@entry=0xf4b328 , 
> pCtx=pCtx@entry=0x7ffdf2aed480, pcql=pcql@entry=0x1d70a58) at 
> caqlanalyze.c:229
> #15 0x005636db in caql_getcount (pCtx0=pCtx0@entry=0x0, 
> pcql=0x1d70a58) at caqlaccess.c:367
> #16 0x009ddc47 in rel_is_partitioned (relid=1882211) at 
> cdbpartition.c:232
> #17 rel_part_status (relid=relid@entry=1882211) at cdbpartition.c:484
> #18 0x005e7d43 in calculate_virtual_segment_number 
> (candidateOids=) at analyze.c:833
> #19 analyzeStmt (stmt=stmt@entry=0x2045dd0, relids=relids@entry=0x0, 
> preferred_seg_num=preferred_seg_num@entry=-1) at analyze.c:486
> #20 0x005e89a7 in analyzeStatement (stmt=stmt@entry=0x2045dd0, 
> relids=relids@entry=0x0, preferred_seg_num=preferred_seg_num@entry=-1) at 
> analyze.c:271
> #21 0x0065c25c in vacuum (vacstmt=vacstmt@entry=0x2045bf0, 
> relids=relids@entry=0x0, preferred_seg_num=preferred_seg_num@entry=-1) at 
> vacuum.c:316
> #22 0x007f6012 in ProcessUtility 
> (parsetree=parsetree@entry=0x2045bf0, queryString=0x2045d30 "ANALYZE 
> mis_data_ig_account_details.e_event_1_0_102", params=0x0, 
> isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0xf04ba0 ,
> completionTag=completionTag@entry=0x7ffdf2aee3f0 "") at utility.c:1471
> #23 0x007f1ade in PortalRunUtility (portal=portal@entry=0x1bfb490, 
> utilityStmt=utilityStmt@entry=0x2045bf0, isTopLevel=isTopLevel@entry=1 
> '\001', dest=dest@entry=0xf04ba0 , 
> completionTag=completionTag@entry=0x7ffdf2aee3f0 "") at pquery.c:1968
> #24 0x007f32be in PortalRunMulti (portal=portal@entry=0x1bfb490, 
> isTopLevel=isTopLevel@entry=1 '\001', dest=0xf04ba0 , 
> dest@entry=0x1b93f60, altdest=0xf04ba0 , 
> altdest@entry=0x1b93f60, completionTag=completionTag@entry=0x7ffdf2aee3f0 "")
> at pquery.c:2078
> #25 0x007f5025 in PortalRun (portal=portal@entry=0x1bfb490, 
> count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
> dest=dest@entry=0x1b93f60, altdest=altdest@entry=0x1b93f60, 
> completionTag=completionTag@entry=0x7ffdf2aee3f0 "") at pquery.c:1595
> #26 0x007ee098 in exec_execute_message (max_rows=9223372036854775807, 
> portal_name=0x1b93ad0 "") at postgres.c:2782
> #27 PostgresMain (argc=, argv=, 
> argv@entry=0x1a49b40, username=0x1a498f0 "mis_ig") at postgres.c:5170
> #28 0x007a0390 in BackendRun (port=0x1a185f0) at postmaster.c:5915
> #29 BackendStartup (port=0x1a185f0) at postmaster.c:5484
> #30 ServerLoop () at postmaster.c:2163
> #31 0x007a3159 in PostmasterMain (argc=, 
> argv=) at postmaster.c:1454
> #32 0x004a52b9 in main (argc=9, argv=0x1a20d10) at main.c:226
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HAWQ-1417) Crashed at ANALYZE after COPY

2017-03-28 Thread Ming LI (JIRA)
Ming LI created HAWQ-1417:
-

 Summary: Crashed at ANALYZE after COPY
 Key: HAWQ-1417
 URL: https://issues.apache.org/jira/browse/HAWQ-1417
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Core
Reporter: Ming LI
Assignee: Ed Espino
 Fix For: backlog


This is the line in the master log where the PANIC is reported:
{code}
(gdb) bt
#0  0x7f6d35b0e6ab in raise () from 
/data/logs/52280/new_panic/packcore-core.postgres.457052/lib64/libpthread.so.0
#1  0x008c7d79 in SafeHandlerForSegvBusIll (postgres_signal_arg=11, 
processName=) at elog.c:4519
#2  
#3  ResourceOwnerEnlargeRelationRefs (owner=0x0) at resowner.c:708
#4  0x008b5659 in RelationIncrementReferenceCount (rel=0x1baf500) at 
relcache.c:1941
#5  RelationIdGetRelation (relationId=relationId@entry=1259) at relcache.c:1895
#6  0x004ca664 in relation_open (lockmode=lockmode@entry=1, 
relationId=relationId@entry=1259) at heapam.c:882
#7  heap_open (relationId=relationId@entry=1259, lockmode=lockmode@entry=1) at 
heapam.c:1285
#8  0x008b0945 in ScanPgRelation (targetRelId=targetRelId@entry=5010, 
indexOK=indexOK@entry=1 '\001', 
pg_class_relation=pg_class_relation@entry=0x7ffdf2aed390) at relcache.c:279
#9  0x008b4302 in RelationBuildDesc (targetRelId=5010, 
insertIt=) at relcache.c:1209
#10 0x008b56c7 in RelationIdGetRelation 
(relationId=relationId@entry=5010) at relcache.c:1918
#11 0x004ca664 in relation_open (lockmode=, 
relationId=5010) at heapam.c:882
#12 heap_open (relationId=5010, lockmode=) at heapam.c:1285
#13 0x0055d1e6 in caql_basic_fn_all (pcql=0x1d70a58, bLockEntireTable=0 
'\000', pCtx=0x7ffdf2aed480, pchn=0xf4b328 ) at 
caqlanalyze.c:343
#14 caql_switch (pchn=pchn@entry=0xf4b328 , 
pCtx=pCtx@entry=0x7ffdf2aed480, pcql=pcql@entry=0x1d70a58) at caqlanalyze.c:229
#15 0x005636db in caql_getcount (pCtx0=pCtx0@entry=0x0, pcql=0x1d70a58) 
at caqlaccess.c:367
#16 0x009ddc47 in rel_is_partitioned (relid=1882211) at 
cdbpartition.c:232
#17 rel_part_status (relid=relid@entry=1882211) at cdbpartition.c:484
#18 0x005e7d43 in calculate_virtual_segment_number 
(candidateOids=) at analyze.c:833
#19 analyzeStmt (stmt=stmt@entry=0x2045dd0, relids=relids@entry=0x0, 
preferred_seg_num=preferred_seg_num@entry=-1) at analyze.c:486
#20 0x005e89a7 in analyzeStatement (stmt=stmt@entry=0x2045dd0, 
relids=relids@entry=0x0, preferred_seg_num=preferred_seg_num@entry=-1) at 
analyze.c:271
#21 0x0065c25c in vacuum (vacstmt=vacstmt@entry=0x2045bf0, 
relids=relids@entry=0x0, preferred_seg_num=preferred_seg_num@entry=-1) at 
vacuum.c:316
#22 0x007f6012 in ProcessUtility (parsetree=parsetree@entry=0x2045bf0, 
queryString=0x2045d30 "ANALYZE mis_data_ig_account_details.e_event_1_0_102", 
params=0x0, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0xf04ba0 
,
completionTag=completionTag@entry=0x7ffdf2aee3f0 "") at utility.c:1471
#23 0x007f1ade in PortalRunUtility (portal=portal@entry=0x1bfb490, 
utilityStmt=utilityStmt@entry=0x2045bf0, isTopLevel=isTopLevel@entry=1 '\001', 
dest=dest@entry=0xf04ba0 , 
completionTag=completionTag@entry=0x7ffdf2aee3f0 "") at pquery.c:1968
#24 0x007f32be in PortalRunMulti (portal=portal@entry=0x1bfb490, 
isTopLevel=isTopLevel@entry=1 '\001', dest=0xf04ba0 , 
dest@entry=0x1b93f60, altdest=0xf04ba0 , altdest@entry=0x1b93f60, 
completionTag=completionTag@entry=0x7ffdf2aee3f0 "")
at pquery.c:2078
#25 0x007f5025 in PortalRun (portal=portal@entry=0x1bfb490, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
dest=dest@entry=0x1b93f60, altdest=altdest@entry=0x1b93f60, 
completionTag=completionTag@entry=0x7ffdf2aee3f0 "") at pquery.c:1595
#26 0x007ee098 in exec_execute_message (max_rows=9223372036854775807, 
portal_name=0x1b93ad0 "") at postgres.c:2782
#27 PostgresMain (argc=, argv=, 
argv@entry=0x1a49b40, username=0x1a498f0 "mis_ig") at postgres.c:5170
#28 0x007a0390 in BackendRun (port=0x1a185f0) at postmaster.c:5915
#29 BackendStartup (port=0x1a185f0) at postmaster.c:5484
#30 ServerLoop () at postmaster.c:2163
#31 0x007a3159 in PostmasterMain (argc=, argv=) at postmaster.c:1454
#32 0x004a52b9 in main (argc=9, argv=0x1a20d10) at main.c:226
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (HAWQ-1408) PANICs during COPY ... FROM STDIN

2017-03-27 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI resolved HAWQ-1408.
---
Resolution: Fixed

> PANICs during COPY ... FROM STDIN
> -
>
> Key: HAWQ-1408
> URL: https://issues.apache.org/jira/browse/HAWQ-1408
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Affects Versions: backlog
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: 2.1.0.0-incubating
>
>
> We found PANIC (and respective core dumps). From the initial analysis from 
> the logs and core dump, the query causing this PANIC is a "COPY ... FROM 
> STDIN". This query does not always panic.
> This kind of queries are executed from Java/Scala code (by one of IG Spark 
> Jobs). Connection to the DB is managed by connection pool (commons-dbcp2) and 
> validated on borrow by “select 1” validation query. IG is using 
> postgresql-9.4-1206-jdbc41 as a java driver to create those connections. I 
> believe they should be using the driver from DataDirect, available in PivNet; 
> however, I haven't found hard evidence pointing the driver as a root cause.
> My initial analysis on the packcore for the master PANIC. Not sure if this 
> helps or makes sense.
> This is the backtrace of the packcore for process 466858:
> {code}
> (gdb) bt
> #0  0x7fd875f906ab in raise () from 
> /data/logs/52280/packcore-core.postgres.466858/lib64/libpthread.so.0
> #1  0x008c0b19 in SafeHandlerForSegvBusIll (postgres_signal_arg=11, 
> processName=) at elog.c:4519
> #2  
> #3  0x0053b9c3 in SetSegnoForWrite (existing_segnos=0x4c46ff0, 
> existing_segnos@entry=0x0, relid=relid@entry=1195061, 
> segment_num=segment_num@entry=6, forNewRel=forNewRel@entry=0 '\000', 
> keepHash=keepHash@entry=1 '\001') at appendonlywriter.c:1166
> #4  0x0053c08f in assignPerRelSegno 
> (all_relids=all_relids@entry=0x2b96d68, segment_num=6) at 
> appendonlywriter.c:1212
> #5  0x005f79e8 in DoCopy (stmt=stmt@entry=0x2b2a3d8, 
> queryString=) at copy.c:1591
> #6  0x007ef737 in ProcessUtility 
> (parsetree=parsetree@entry=0x2b2a3d8, queryString=0x2c2f550 "COPY 
> mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, 
> tracking_id, name, value_string, value_timestamp, value_number, 
> value_boolean, environment, account, channel, device, feat"...,
> params=0x0, isTopLevel=isTopLevel@entry=1 '\001', 
> dest=dest@entry=0x2b2a7c8, completionTag=completionTag@entry=0x7ffcb5e318e0 
> "") at utility.c:1076
> #7  0x007ea95e in PortalRunUtility (portal=portal@entry=0x2b8eab0, 
> utilityStmt=utilityStmt@entry=0x2b2a3d8, isTopLevel=isTopLevel@entry=1 
> '\001', dest=dest@entry=0x2b2a7c8, 
> completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:1969
> #8  0x007ec13e in PortalRunMulti (portal=portal@entry=0x2b8eab0, 
> isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x2b2a7c8, 
> altdest=altdest@entry=0x2b2a7c8, 
> completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:2079
> #9  0x007ede95 in PortalRun (portal=portal@entry=0x2b8eab0, 
> count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
> dest=dest@entry=0x2b2a7c8, altdest=altdest@entry=0x2b2a7c8, 
> completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:1596
> #10 0x007e5ad9 in exec_simple_query 
> (query_string=query_string@entry=0x2b29100 "COPY 
> mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, 
> tracking_id, name, value_string, value_timestamp, value_number, 
> value_boolean, environment, account, channel, device, feat"...,
> seqServerHost=seqServerHost@entry=0x0, 
> seqServerPort=seqServerPort@entry=-1) at postgres.c:1816
> #11 0x007e6cb2 in PostgresMain (argc=, argv= out>, argv@entry=0x29d7820, username=0x29d75d0 "mis_ig") at postgres.c:4840
> #12 0x00799540 in BackendRun (port=0x29afc50) at postmaster.c:5915
> #13 BackendStartup (port=0x29afc50) at postmaster.c:5484
> #14 ServerLoop () at postmaster.c:2163
> #15 0x0079c309 in PostmasterMain (argc=, 
> argv=) at postmaster.c:1454
> #16 0x004a4209 in main (argc=9, argv=0x29af010) at main.c:226
> {code}
> Jumping into the frame 3 and running info locals, we found something odd for 
> "status" variable:
> {code}
> (gdb) f 3
> #3  0x0053b9c3 in SetSegnoForWrite (existing_segnos=0x4c46ff0, 
> existing_segnos@entry=0x0, relid=relid@entry=1195061, 
> segment_num=segment_num@entry=6, forNewRel=forNewRel@entry=0 '\000', 
> keepHash=keepHash@entry=1 '\001') at appendonlywriter.c:1166
> 1166  appendonlywriter.c: No such file or directory.
> (gdb) info locals
> status = 0x0
> [...]
> {code}
> This panic comes from this piece of code in "appendonlywritter.c":
> {code}
> for (int i = 0; i < segment_num; i++)
> {
> AOSegfileStatus *status = 

[jira] [Comment Edited] (HAWQ-1408) PANICs during COPY ... FROM STDIN

2017-03-27 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940431#comment-15940431
 ] 

Ming LI edited comment on HAWQ-1408 at 3/28/17 2:20 AM:


In case of there is no reproduce steps, I only went through the code to find 
out possible root cause. Now I get one possible cause.

The possible reason is related to 
https://issues.apache.org/jira/browse/HAWQ-642.
(1) When keepHash, we can not guarantee that only generate remaining_num is 
enough to alloc seg file for all segment_num. Because the next new generated 
seg file number may have same hash key id with the old one.
(2) When !keepHash, now the remaining_num returned from addCandidateSegno() is 
not precise. So we need to fix it to meet the need of HAWQ-642.
(3) At the final call to addCandidateSegno(), we should keep monitoring the 
remaining_num instead of at the beginning because of the reason (1). So that 
even the new seg file is not actually used by this query ( maybe for hash key 
conflict), we can continue to alloc enough seg files for this query.

@lilima1 @hubertzhang , please correct me if I am wrong. Thanks.


was (Author: mli):
In case of there is no reproduce steps, I only went through the code to find 
out possible root cause. Now I get one possible cause.

The possible reason is related https://issues.apache.org/jira/browse/HAWQ-642.
(1) When keepHash, we can not guarantee that only generate remaining_num is 
enough to alloc seg file for all segment_num. Because the next new generated 
seg file number may have same hash key id with the old one.
(2) When !keepHash, now the remaining_num returned from addCandidateSegno() is 
not precise. So we need to fix it to meet the need of HAWQ-642.
(3) At the final call to addCandidateSegno(), we should keep monitoring the 
remaining_num instead of at the beginning because of the reason (1). So that 
even the new seg file is not actually used by this query ( maybe for hash key 
conflict), we can continue to alloc enough seg files for this query.

@lilima1 @hubertzhang , please correct me if I am wrong. Thanks.

> PANICs during COPY ... FROM STDIN
> -
>
> Key: HAWQ-1408
> URL: https://issues.apache.org/jira/browse/HAWQ-1408
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Affects Versions: backlog
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: 2.1.0.0-incubating
>
>
> We found PANIC (and respective core dumps). From the initial analysis from 
> the logs and core dump, the query causing this PANIC is a "COPY ... FROM 
> STDIN". This query does not always panic.
> This kind of queries are executed from Java/Scala code (by one of IG Spark 
> Jobs). Connection to the DB is managed by connection pool (commons-dbcp2) and 
> validated on borrow by “select 1” validation query. IG is using 
> postgresql-9.4-1206-jdbc41 as a java driver to create those connections. I 
> believe they should be using the driver from DataDirect, available in PivNet; 
> however, I haven't found hard evidence pointing the driver as a root cause.
> My initial analysis on the packcore for the master PANIC. Not sure if this 
> helps or makes sense.
> This is the backtrace of the packcore for process 466858:
> {code}
> (gdb) bt
> #0  0x7fd875f906ab in raise () from 
> /data/logs/52280/packcore-core.postgres.466858/lib64/libpthread.so.0
> #1  0x008c0b19 in SafeHandlerForSegvBusIll (postgres_signal_arg=11, 
> processName=) at elog.c:4519
> #2  
> #3  0x0053b9c3 in SetSegnoForWrite (existing_segnos=0x4c46ff0, 
> existing_segnos@entry=0x0, relid=relid@entry=1195061, 
> segment_num=segment_num@entry=6, forNewRel=forNewRel@entry=0 '\000', 
> keepHash=keepHash@entry=1 '\001') at appendonlywriter.c:1166
> #4  0x0053c08f in assignPerRelSegno 
> (all_relids=all_relids@entry=0x2b96d68, segment_num=6) at 
> appendonlywriter.c:1212
> #5  0x005f79e8 in DoCopy (stmt=stmt@entry=0x2b2a3d8, 
> queryString=) at copy.c:1591
> #6  0x007ef737 in ProcessUtility 
> (parsetree=parsetree@entry=0x2b2a3d8, queryString=0x2c2f550 "COPY 
> mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, 
> tracking_id, name, value_string, value_timestamp, value_number, 
> value_boolean, environment, account, channel, device, feat"...,
> params=0x0, isTopLevel=isTopLevel@entry=1 '\001', 
> dest=dest@entry=0x2b2a7c8, completionTag=completionTag@entry=0x7ffcb5e318e0 
> "") at utility.c:1076
> #7  0x007ea95e in PortalRunUtility (portal=portal@entry=0x2b8eab0, 
> utilityStmt=utilityStmt@entry=0x2b2a3d8, isTopLevel=isTopLevel@entry=1 
> '\001', dest=dest@entry=0x2b2a7c8, 
> completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:1969
> #8  0x007ec13e in PortalRunMulti (portal=portal@entry=0x2b8eab0, 
> 

[jira] [Created] (HAWQ-1408) PANICs during COPY ... FROM STDIN

2017-03-24 Thread Ming LI (JIRA)
Ming LI created HAWQ-1408:
-

 Summary: PANICs during COPY ... FROM STDIN
 Key: HAWQ-1408
 URL: https://issues.apache.org/jira/browse/HAWQ-1408
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Core
Reporter: Ming LI
Assignee: Ed Espino
 Fix For: 2.1.0.0-incubating


We found PANIC (and respective core dumps). From the initial analysis from the 
logs and core dump, the query causing this PANIC is a "COPY ... FROM STDIN". 
This query does not always panic.
This kind of queries are executed from Java/Scala code (by one of IG Spark 
Jobs). Connection to the DB is managed by connection pool (commons-dbcp2) and 
validated on borrow by “select 1” validation query. IG is using 
postgresql-9.4-1206-jdbc41 as a java driver to create those connections. I 
believe they should be using the driver from DataDirect, available in PivNet; 
however, I haven't found hard evidence pointing the driver as a root cause.

My initial analysis on the packcore for the master PANIC. Not sure if this 
helps or makes sense.

This is the backtrace of the packcore for process 466858:

{code}
(gdb) bt
#0  0x7fd875f906ab in raise () from 
/data/logs/52280/packcore-core.postgres.466858/lib64/libpthread.so.0
#1  0x008c0b19 in SafeHandlerForSegvBusIll (postgres_signal_arg=11, 
processName=) at elog.c:4519
#2  
#3  0x0053b9c3 in SetSegnoForWrite (existing_segnos=0x4c46ff0, 
existing_segnos@entry=0x0, relid=relid@entry=1195061, 
segment_num=segment_num@entry=6, forNewRel=forNewRel@entry=0 '\000', 
keepHash=keepHash@entry=1 '\001') at appendonlywriter.c:1166
#4  0x0053c08f in assignPerRelSegno 
(all_relids=all_relids@entry=0x2b96d68, segment_num=6) at 
appendonlywriter.c:1212
#5  0x005f79e8 in DoCopy (stmt=stmt@entry=0x2b2a3d8, 
queryString=) at copy.c:1591
#6  0x007ef737 in ProcessUtility (parsetree=parsetree@entry=0x2b2a3d8, 
queryString=0x2c2f550 "COPY 
mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, 
tracking_id, name, value_string, value_timestamp, value_number, value_boolean, 
environment, account, channel, device, feat"...,
params=0x0, isTopLevel=isTopLevel@entry=1 '\001', 
dest=dest@entry=0x2b2a7c8, completionTag=completionTag@entry=0x7ffcb5e318e0 "") 
at utility.c:1076
#7  0x007ea95e in PortalRunUtility (portal=portal@entry=0x2b8eab0, 
utilityStmt=utilityStmt@entry=0x2b2a3d8, isTopLevel=isTopLevel@entry=1 '\001', 
dest=dest@entry=0x2b2a7c8, completionTag=completionTag@entry=0x7ffcb5e318e0 "") 
at pquery.c:1969
#8  0x007ec13e in PortalRunMulti (portal=portal@entry=0x2b8eab0, 
isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x2b2a7c8, 
altdest=altdest@entry=0x2b2a7c8, 
completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:2079
#9  0x007ede95 in PortalRun (portal=portal@entry=0x2b8eab0, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
dest=dest@entry=0x2b2a7c8, altdest=altdest@entry=0x2b2a7c8, 
completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:1596
#10 0x007e5ad9 in exec_simple_query 
(query_string=query_string@entry=0x2b29100 "COPY 
mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, 
tracking_id, name, value_string, value_timestamp, value_number, value_boolean, 
environment, account, channel, device, feat"...,
seqServerHost=seqServerHost@entry=0x0, 
seqServerPort=seqServerPort@entry=-1) at postgres.c:1816
#11 0x007e6cb2 in PostgresMain (argc=, argv=, argv@entry=0x29d7820, username=0x29d75d0 "mis_ig") at postgres.c:4840
#12 0x00799540 in BackendRun (port=0x29afc50) at postmaster.c:5915
#13 BackendStartup (port=0x29afc50) at postmaster.c:5484
#14 ServerLoop () at postmaster.c:2163
#15 0x0079c309 in PostmasterMain (argc=, argv=) at postmaster.c:1454
#16 0x004a4209 in main (argc=9, argv=0x29af010) at main.c:226
{code}

Jumping into the frame 3 and running info locals, we found something odd for 
"status" variable:

{code}
(gdb) f 3
#3  0x0053b9c3 in SetSegnoForWrite (existing_segnos=0x4c46ff0, 
existing_segnos@entry=0x0, relid=relid@entry=1195061, 
segment_num=segment_num@entry=6, forNewRel=forNewRel@entry=0 '\000', 
keepHash=keepHash@entry=1 '\001') at appendonlywriter.c:1166
1166appendonlywriter.c: No such file or directory.
(gdb) info locals
status = 0x0
[...]
{code}

This panic comes from this piece of code in "appendonlywritter.c":

{code}
for (int i = 0; i < segment_num; i++)
{
AOSegfileStatus *status = maxSegno4Segment[i];
status->inuse = true;
status->xid = CurrentXid;
existing_segnos = lappend_int(existing_segnos,  status->segno);
}
{code}

So, we are pulling a 0x0 (null ?!) entry from _maxSegno4Segment_... That's 
extrange, because earlier in this function we populate this array, and we 
should not reach this section unless this 

[jira] [Updated] (HAWQ-1408) PANICs during COPY ... FROM STDIN

2017-03-24 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI updated HAWQ-1408:
--
Affects Version/s: backlog

> PANICs during COPY ... FROM STDIN
> -
>
> Key: HAWQ-1408
> URL: https://issues.apache.org/jira/browse/HAWQ-1408
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Affects Versions: backlog
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: 2.1.0.0-incubating
>
>
> We found PANIC (and respective core dumps). From the initial analysis from 
> the logs and core dump, the query causing this PANIC is a "COPY ... FROM 
> STDIN". This query does not always panic.
> This kind of queries are executed from Java/Scala code (by one of IG Spark 
> Jobs). Connection to the DB is managed by connection pool (commons-dbcp2) and 
> validated on borrow by “select 1” validation query. IG is using 
> postgresql-9.4-1206-jdbc41 as a java driver to create those connections. I 
> believe they should be using the driver from DataDirect, available in PivNet; 
> however, I haven't found hard evidence pointing the driver as a root cause.
> My initial analysis on the packcore for the master PANIC. Not sure if this 
> helps or makes sense.
> This is the backtrace of the packcore for process 466858:
> {code}
> (gdb) bt
> #0  0x7fd875f906ab in raise () from 
> /data/logs/52280/packcore-core.postgres.466858/lib64/libpthread.so.0
> #1  0x008c0b19 in SafeHandlerForSegvBusIll (postgres_signal_arg=11, 
> processName=) at elog.c:4519
> #2  
> #3  0x0053b9c3 in SetSegnoForWrite (existing_segnos=0x4c46ff0, 
> existing_segnos@entry=0x0, relid=relid@entry=1195061, 
> segment_num=segment_num@entry=6, forNewRel=forNewRel@entry=0 '\000', 
> keepHash=keepHash@entry=1 '\001') at appendonlywriter.c:1166
> #4  0x0053c08f in assignPerRelSegno 
> (all_relids=all_relids@entry=0x2b96d68, segment_num=6) at 
> appendonlywriter.c:1212
> #5  0x005f79e8 in DoCopy (stmt=stmt@entry=0x2b2a3d8, 
> queryString=) at copy.c:1591
> #6  0x007ef737 in ProcessUtility 
> (parsetree=parsetree@entry=0x2b2a3d8, queryString=0x2c2f550 "COPY 
> mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, 
> tracking_id, name, value_string, value_timestamp, value_number, 
> value_boolean, environment, account, channel, device, feat"...,
> params=0x0, isTopLevel=isTopLevel@entry=1 '\001', 
> dest=dest@entry=0x2b2a7c8, completionTag=completionTag@entry=0x7ffcb5e318e0 
> "") at utility.c:1076
> #7  0x007ea95e in PortalRunUtility (portal=portal@entry=0x2b8eab0, 
> utilityStmt=utilityStmt@entry=0x2b2a3d8, isTopLevel=isTopLevel@entry=1 
> '\001', dest=dest@entry=0x2b2a7c8, 
> completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:1969
> #8  0x007ec13e in PortalRunMulti (portal=portal@entry=0x2b8eab0, 
> isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x2b2a7c8, 
> altdest=altdest@entry=0x2b2a7c8, 
> completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:2079
> #9  0x007ede95 in PortalRun (portal=portal@entry=0x2b8eab0, 
> count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
> dest=dest@entry=0x2b2a7c8, altdest=altdest@entry=0x2b2a7c8, 
> completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:1596
> #10 0x007e5ad9 in exec_simple_query 
> (query_string=query_string@entry=0x2b29100 "COPY 
> mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, 
> tracking_id, name, value_string, value_timestamp, value_number, 
> value_boolean, environment, account, channel, device, feat"...,
> seqServerHost=seqServerHost@entry=0x0, 
> seqServerPort=seqServerPort@entry=-1) at postgres.c:1816
> #11 0x007e6cb2 in PostgresMain (argc=, argv= out>, argv@entry=0x29d7820, username=0x29d75d0 "mis_ig") at postgres.c:4840
> #12 0x00799540 in BackendRun (port=0x29afc50) at postmaster.c:5915
> #13 BackendStartup (port=0x29afc50) at postmaster.c:5484
> #14 ServerLoop () at postmaster.c:2163
> #15 0x0079c309 in PostmasterMain (argc=, 
> argv=) at postmaster.c:1454
> #16 0x004a4209 in main (argc=9, argv=0x29af010) at main.c:226
> {code}
> Jumping into the frame 3 and running info locals, we found something odd for 
> "status" variable:
> {code}
> (gdb) f 3
> #3  0x0053b9c3 in SetSegnoForWrite (existing_segnos=0x4c46ff0, 
> existing_segnos@entry=0x0, relid=relid@entry=1195061, 
> segment_num=segment_num@entry=6, forNewRel=forNewRel@entry=0 '\000', 
> keepHash=keepHash@entry=1 '\001') at appendonlywriter.c:1166
> 1166  appendonlywriter.c: No such file or directory.
> (gdb) info locals
> status = 0x0
> [...]
> {code}
> This panic comes from this piece of code in "appendonlywritter.c":
> {code}
> for (int i = 0; i < segment_num; i++)
> {
> AOSegfileStatus 

[jira] [Commented] (HAWQ-1408) PANICs during COPY ... FROM STDIN

2017-03-24 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940431#comment-15940431
 ] 

Ming LI commented on HAWQ-1408:
---

In case of there is no reproduce steps, I only went through the code to find 
out possible root cause. Now I get one possible cause.

The possible reason is related https://issues.apache.org/jira/browse/HAWQ-642.
(1) When keepHash, we can not guarantee that only generate remaining_num is 
enough to alloc seg file for all segment_num. Because the next new generated 
seg file number may have same hash key id with the old one.
(2) When !keepHash, now the remaining_num returned from addCandidateSegno() is 
not precise. So we need to fix it to meet the need of HAWQ-642.
(3) At the final call to addCandidateSegno(), we should keep monitoring the 
remaining_num instead of at the beginning because of the reason (1). So that 
even the new seg file is not actually used by this query ( maybe for hash key 
conflict), we can continue to alloc enough seg files for this query.

@lilima1 @hubertzhang , please correct me if I am wrong. Thanks.

> PANICs during COPY ... FROM STDIN
> -
>
> Key: HAWQ-1408
> URL: https://issues.apache.org/jira/browse/HAWQ-1408
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Affects Versions: backlog
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: 2.1.0.0-incubating
>
>
> We found PANIC (and respective core dumps). From the initial analysis from 
> the logs and core dump, the query causing this PANIC is a "COPY ... FROM 
> STDIN". This query does not always panic.
> This kind of queries are executed from Java/Scala code (by one of IG Spark 
> Jobs). Connection to the DB is managed by connection pool (commons-dbcp2) and 
> validated on borrow by “select 1” validation query. IG is using 
> postgresql-9.4-1206-jdbc41 as a java driver to create those connections. I 
> believe they should be using the driver from DataDirect, available in PivNet; 
> however, I haven't found hard evidence pointing the driver as a root cause.
> My initial analysis on the packcore for the master PANIC. Not sure if this 
> helps or makes sense.
> This is the backtrace of the packcore for process 466858:
> {code}
> (gdb) bt
> #0  0x7fd875f906ab in raise () from 
> /data/logs/52280/packcore-core.postgres.466858/lib64/libpthread.so.0
> #1  0x008c0b19 in SafeHandlerForSegvBusIll (postgres_signal_arg=11, 
> processName=) at elog.c:4519
> #2  
> #3  0x0053b9c3 in SetSegnoForWrite (existing_segnos=0x4c46ff0, 
> existing_segnos@entry=0x0, relid=relid@entry=1195061, 
> segment_num=segment_num@entry=6, forNewRel=forNewRel@entry=0 '\000', 
> keepHash=keepHash@entry=1 '\001') at appendonlywriter.c:1166
> #4  0x0053c08f in assignPerRelSegno 
> (all_relids=all_relids@entry=0x2b96d68, segment_num=6) at 
> appendonlywriter.c:1212
> #5  0x005f79e8 in DoCopy (stmt=stmt@entry=0x2b2a3d8, 
> queryString=) at copy.c:1591
> #6  0x007ef737 in ProcessUtility 
> (parsetree=parsetree@entry=0x2b2a3d8, queryString=0x2c2f550 "COPY 
> mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, 
> tracking_id, name, value_string, value_timestamp, value_number, 
> value_boolean, environment, account, channel, device, feat"...,
> params=0x0, isTopLevel=isTopLevel@entry=1 '\001', 
> dest=dest@entry=0x2b2a7c8, completionTag=completionTag@entry=0x7ffcb5e318e0 
> "") at utility.c:1076
> #7  0x007ea95e in PortalRunUtility (portal=portal@entry=0x2b8eab0, 
> utilityStmt=utilityStmt@entry=0x2b2a3d8, isTopLevel=isTopLevel@entry=1 
> '\001', dest=dest@entry=0x2b2a7c8, 
> completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:1969
> #8  0x007ec13e in PortalRunMulti (portal=portal@entry=0x2b8eab0, 
> isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x2b2a7c8, 
> altdest=altdest@entry=0x2b2a7c8, 
> completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:2079
> #9  0x007ede95 in PortalRun (portal=portal@entry=0x2b8eab0, 
> count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
> dest=dest@entry=0x2b2a7c8, altdest=altdest@entry=0x2b2a7c8, 
> completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:1596
> #10 0x007e5ad9 in exec_simple_query 
> (query_string=query_string@entry=0x2b29100 "COPY 
> mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, 
> tracking_id, name, value_string, value_timestamp, value_number, 
> value_boolean, environment, account, channel, device, feat"...,
> seqServerHost=seqServerHost@entry=0x0, 
> seqServerPort=seqServerPort@entry=-1) at postgres.c:1816
> #11 0x007e6cb2 in PostgresMain (argc=, argv= out>, argv@entry=0x29d7820, username=0x29d75d0 "mis_ig") at postgres.c:4840
> #12 0x00799540 in 

[jira] [Assigned] (HAWQ-1408) PANICs during COPY ... FROM STDIN

2017-03-24 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1408:
-

Assignee: Ming LI  (was: Ed Espino)

> PANICs during COPY ... FROM STDIN
> -
>
> Key: HAWQ-1408
> URL: https://issues.apache.org/jira/browse/HAWQ-1408
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: 2.1.0.0-incubating
>
>
> We found PANIC (and respective core dumps). From the initial analysis from 
> the logs and core dump, the query causing this PANIC is a "COPY ... FROM 
> STDIN". This query does not always panic.
> This kind of queries are executed from Java/Scala code (by one of IG Spark 
> Jobs). Connection to the DB is managed by connection pool (commons-dbcp2) and 
> validated on borrow by “select 1” validation query. IG is using 
> postgresql-9.4-1206-jdbc41 as a java driver to create those connections. I 
> believe they should be using the driver from DataDirect, available in PivNet; 
> however, I haven't found hard evidence pointing the driver as a root cause.
> My initial analysis on the packcore for the master PANIC. Not sure if this 
> helps or makes sense.
> This is the backtrace of the packcore for process 466858:
> {code}
> (gdb) bt
> #0  0x7fd875f906ab in raise () from 
> /data/logs/52280/packcore-core.postgres.466858/lib64/libpthread.so.0
> #1  0x008c0b19 in SafeHandlerForSegvBusIll (postgres_signal_arg=11, 
> processName=) at elog.c:4519
> #2  
> #3  0x0053b9c3 in SetSegnoForWrite (existing_segnos=0x4c46ff0, 
> existing_segnos@entry=0x0, relid=relid@entry=1195061, 
> segment_num=segment_num@entry=6, forNewRel=forNewRel@entry=0 '\000', 
> keepHash=keepHash@entry=1 '\001') at appendonlywriter.c:1166
> #4  0x0053c08f in assignPerRelSegno 
> (all_relids=all_relids@entry=0x2b96d68, segment_num=6) at 
> appendonlywriter.c:1212
> #5  0x005f79e8 in DoCopy (stmt=stmt@entry=0x2b2a3d8, 
> queryString=) at copy.c:1591
> #6  0x007ef737 in ProcessUtility 
> (parsetree=parsetree@entry=0x2b2a3d8, queryString=0x2c2f550 "COPY 
> mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, 
> tracking_id, name, value_string, value_timestamp, value_number, 
> value_boolean, environment, account, channel, device, feat"...,
> params=0x0, isTopLevel=isTopLevel@entry=1 '\001', 
> dest=dest@entry=0x2b2a7c8, completionTag=completionTag@entry=0x7ffcb5e318e0 
> "") at utility.c:1076
> #7  0x007ea95e in PortalRunUtility (portal=portal@entry=0x2b8eab0, 
> utilityStmt=utilityStmt@entry=0x2b2a3d8, isTopLevel=isTopLevel@entry=1 
> '\001', dest=dest@entry=0x2b2a7c8, 
> completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:1969
> #8  0x007ec13e in PortalRunMulti (portal=portal@entry=0x2b8eab0, 
> isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x2b2a7c8, 
> altdest=altdest@entry=0x2b2a7c8, 
> completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:2079
> #9  0x007ede95 in PortalRun (portal=portal@entry=0x2b8eab0, 
> count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
> dest=dest@entry=0x2b2a7c8, altdest=altdest@entry=0x2b2a7c8, 
> completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:1596
> #10 0x007e5ad9 in exec_simple_query 
> (query_string=query_string@entry=0x2b29100 "COPY 
> mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, 
> tracking_id, name, value_string, value_timestamp, value_number, 
> value_boolean, environment, account, channel, device, feat"...,
> seqServerHost=seqServerHost@entry=0x0, 
> seqServerPort=seqServerPort@entry=-1) at postgres.c:1816
> #11 0x007e6cb2 in PostgresMain (argc=, argv= out>, argv@entry=0x29d7820, username=0x29d75d0 "mis_ig") at postgres.c:4840
> #12 0x00799540 in BackendRun (port=0x29afc50) at postmaster.c:5915
> #13 BackendStartup (port=0x29afc50) at postmaster.c:5484
> #14 ServerLoop () at postmaster.c:2163
> #15 0x0079c309 in PostmasterMain (argc=, 
> argv=) at postmaster.c:1454
> #16 0x004a4209 in main (argc=9, argv=0x29af010) at main.c:226
> {code}
> Jumping into the frame 3 and running info locals, we found something odd for 
> "status" variable:
> {code}
> (gdb) f 3
> #3  0x0053b9c3 in SetSegnoForWrite (existing_segnos=0x4c46ff0, 
> existing_segnos@entry=0x0, relid=relid@entry=1195061, 
> segment_num=segment_num@entry=6, forNewRel=forNewRel@entry=0 '\000', 
> keepHash=keepHash@entry=1 '\001') at appendonlywriter.c:1166
> 1166  appendonlywriter.c: No such file or directory.
> (gdb) info locals
> status = 0x0
> [...]
> {code}
> This panic comes from this piece of code in "appendonlywritter.c":
> {code}
> for (int i = 0; i < segment_num; i++)
> {
> AOSegfileStatus *status = 

[jira] [Commented] (HAWQ-1389) TestRowTypes.BasicTest fails wrong with error "tuple concurrently updated"

2017-03-16 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927845#comment-15927845
 ] 

Ming LI commented on HAWQ-1389:
---

I had thought that the root cause is the testing relation name conflict with 
other test case.
However the feature test will automatically add schema name in form of 
"class_func". So it should be other cause, need more time for investigating.

> TestRowTypes.BasicTest fails wrong with  error "tuple concurrently updated"
> ---
>
> Key: HAWQ-1389
> URL: https://issues.apache.org/jira/browse/HAWQ-1389
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Tests
>Reporter: Ming LI
>Assignee: Ming LI
>
> https://hawq.ci.pivotalci.info/teams/main/pipelines/hdb/jobs/fullfeaturetest_opt_centos6/builds/31
> [2/184] TestRowTypes.BasicTest (17967 ms)
> Note: Google Test filter = TestRowTypes.BasicTest
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from TestRowTypes
> [ RUN ] TestRowTypes.BasicTest
> COPY tenk1 FROM '/tmp/build/d29698ca/featuretest/query/data/tenk.data'
> lib/sql_util.cpp:93: Failure
> Expected: 0
> To be equal to: (conn->runSQLCommand(sql)).getLastStatus()
> Which is: 1
> NOTICE: drop cascades to type testrowtypes_basictest.quad
> NOTICE: drop cascades to append only table pg_temp_32.quadtable column q
> ERROR: tuple concurrently updated (heapam.c:2689)
> [ FAILED ] TestRowTypes.BasicTest (17958 ms)
> [--] 1 test from TestRowTypes (17958 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (17958 ms total)
> [ PASSED ] 0 tests.
> [ FAILED ] 1 test, listed below:
> [ FAILED ] TestRowTypes.BasicTest



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (HAWQ-1389) TestRowTypes.BasicTest fails wrong with error "tuple concurrently updated"

2017-03-16 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1389:
-

Assignee: Ming LI  (was: Jiali Yao)

> TestRowTypes.BasicTest fails wrong with  error "tuple concurrently updated"
> ---
>
> Key: HAWQ-1389
> URL: https://issues.apache.org/jira/browse/HAWQ-1389
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Tests
>Reporter: Ming LI
>Assignee: Ming LI
>
> https://hawq.ci.pivotalci.info/teams/main/pipelines/hdb/jobs/fullfeaturetest_opt_centos6/builds/31
> [2/184] TestRowTypes.BasicTest (17967 ms)
> Note: Google Test filter = TestRowTypes.BasicTest
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from TestRowTypes
> [ RUN ] TestRowTypes.BasicTest
> COPY tenk1 FROM '/tmp/build/d29698ca/featuretest/query/data/tenk.data'
> lib/sql_util.cpp:93: Failure
> Expected: 0
> To be equal to: (conn->runSQLCommand(sql)).getLastStatus()
> Which is: 1
> NOTICE: drop cascades to type testrowtypes_basictest.quad
> NOTICE: drop cascades to append only table pg_temp_32.quadtable column q
> ERROR: tuple concurrently updated (heapam.c:2689)
> [ FAILED ] TestRowTypes.BasicTest (17958 ms)
> [--] 1 test from TestRowTypes (17958 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (17958 ms total)
> [ PASSED ] 0 tests.
> [ FAILED ] 1 test, listed below:
> [ FAILED ] TestRowTypes.BasicTest



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HAWQ-1389) TestRowTypes.BasicTest fails wrong with error "tuple concurrently updated"

2017-03-16 Thread Ming LI (JIRA)
Ming LI created HAWQ-1389:
-

 Summary: TestRowTypes.BasicTest fails wrong with  error "tuple 
concurrently updated"
 Key: HAWQ-1389
 URL: https://issues.apache.org/jira/browse/HAWQ-1389
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Tests
Reporter: Ming LI
Assignee: Jiali Yao


https://hawq.ci.pivotalci.info/teams/main/pipelines/hdb/jobs/fullfeaturetest_opt_centos6/builds/31
[2/184] TestRowTypes.BasicTest (17967 ms)
Note: Google Test filter = TestRowTypes.BasicTest
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from TestRowTypes
[ RUN ] TestRowTypes.BasicTest
COPY tenk1 FROM '/tmp/build/d29698ca/featuretest/query/data/tenk.data'
lib/sql_util.cpp:93: Failure
Expected: 0
To be equal to: (conn->runSQLCommand(sql)).getLastStatus()
Which is: 1
NOTICE: drop cascades to type testrowtypes_basictest.quad
NOTICE: drop cascades to append only table pg_temp_32.quadtable column q
ERROR: tuple concurrently updated (heapam.c:2689)
[ FAILED ] TestRowTypes.BasicTest (17958 ms)
[--] 1 test from TestRowTypes (17958 ms total)
[--] Global test environment tear-down
[==] 1 test from 1 test case ran. (17958 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] TestRowTypes.BasicTest



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (HAWQ-1342) QE process hang in shared input scan on segment node

2017-02-23 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI resolved HAWQ-1342.
---
Resolution: Fixed

> QE process hang in shared input scan on segment node
> 
>
> Key: HAWQ-1342
> URL: https://issues.apache.org/jira/browse/HAWQ-1342
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Query Execution
>Affects Versions: 2.0.0.0-incubating
>Reporter: Amy
>Assignee: Ming LI
> Fix For: backlog
>
>
> QE process hang on some segment node while QD and QE on other segment nodes 
> terminated.
> {code}
> [gpadmin@test1 ~]$ cat hostfile
> test1   master   secondary namenode
> test2   segment   datanode
> test3   segment   datanode
> test4   segment   datanode
> test5   segment   namenode
> [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep
> gpadmin   41877  1  0 05:35 ?00:01:04 
> /usr/local/hawq_2_1_0_0/bin/postgres -D 
> /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd
>  -i -M segment -p 20100 --silent-mode=true
> gpadmin   41878  41877  0 05:35 ?00:00:02 postgres: port 20100, 
> logger process
> gpadmin   41881  41877  0 05:35 ?00:00:00 postgres: port 20100, stats 
> collector process
> gpadmin   41882  41877  0 05:35 ?00:00:07 postgres: port 20100, 
> writer process
> gpadmin   41883  41877  0 05:35 ?00:00:01 postgres: port 20100, 
> checkpoint process
> gpadmin   41884  41877  0 05:35 ?00:00:11 postgres: port 20100, 
> segment resource manager
> gpadmin   42108  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   42416  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   44807  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC 
> SELECT
> gpadmin   44819  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   44821  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC 
> SELECT
> gpadmin   45447  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   49859  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC 
> SELECT
> gpadmin   49881  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51937  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51939  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   51941  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   51943  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC 
> SELECT
> gpadmin   51953  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC 
> SELECT
> gpadmin   53436  41877  0 05:40 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC 
> SELECT
> gpadmin   57095  41877  0 05:41 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   57097  41877  0 05:41 ?00:00:04 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8451) con782 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   63159  41877  0 05:43 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(11474) con1082 seg0 cmd2 slice15 
> MPPEXEC SELECT
> gpadmin   64018  41877  0 05:44 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(11905) con1121 seg0 cmd2 slice5 MPPEXEC 
> SELECT
> {code}
> The stack info is as below and it seems that QE hang in shared input scan.
> {code}
> [gpadmin@test3 ~]$ gdb -p 42108
> (gdb) info threads
>   2 Thread 0x7f4f6b335700 (LWP 42109)  0x0032214df283 in poll () from 
> /lib64/libc.so.6
> * 1 Thread 0x7f4f9041c920 (LWP 42108)  0x0032214e1523 in select () from 
> 

[jira] [Commented] (HAWQ-1342) QE process hang in shared input scan on segment node

2017-02-22 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880029#comment-15880029
 ] 

Ming LI commented on HAWQ-1342:
---

Someone may have question for the behavior of select() error returns, below is 
the summary:

Now the behavior for select() ERRORS is:
- On both system:
EBADF  -- break
EINTR   -- loop again
EINVAL -- programming error, should not occur

- On Linux:
ENOMEM -- loop again, waiting for runaway to choose one transaction to 
rollback, or OS choose one process to kill

- On macos:
EAGAIN -- loop again

Conclusion: 
---
So we just process the EBADF only, others are loop again or impossible to 
occur. Thanks.


> QE process hang in shared input scan on segment node
> 
>
> Key: HAWQ-1342
> URL: https://issues.apache.org/jira/browse/HAWQ-1342
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Query Execution
>Affects Versions: 2.0.0.0-incubating
>Reporter: Amy
>Assignee: Ming LI
> Fix For: backlog
>
>
> QE process hang on some segment node while QD and QE on other segment nodes 
> terminated.
> {code}
> [gpadmin@test1 ~]$ cat hostfile
> test1   master   secondary namenode
> test2   segment   datanode
> test3   segment   datanode
> test4   segment   datanode
> test5   segment   namenode
> [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep
> gpadmin   41877  1  0 05:35 ?00:01:04 
> /usr/local/hawq_2_1_0_0/bin/postgres -D 
> /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd
>  -i -M segment -p 20100 --silent-mode=true
> gpadmin   41878  41877  0 05:35 ?00:00:02 postgres: port 20100, 
> logger process
> gpadmin   41881  41877  0 05:35 ?00:00:00 postgres: port 20100, stats 
> collector process
> gpadmin   41882  41877  0 05:35 ?00:00:07 postgres: port 20100, 
> writer process
> gpadmin   41883  41877  0 05:35 ?00:00:01 postgres: port 20100, 
> checkpoint process
> gpadmin   41884  41877  0 05:35 ?00:00:11 postgres: port 20100, 
> segment resource manager
> gpadmin   42108  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   42416  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   44807  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC 
> SELECT
> gpadmin   44819  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   44821  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC 
> SELECT
> gpadmin   45447  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   49859  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC 
> SELECT
> gpadmin   49881  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51937  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51939  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   51941  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   51943  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC 
> SELECT
> gpadmin   51953  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC 
> SELECT
> gpadmin   53436  41877  0 05:40 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC 
> SELECT
> gpadmin   57095  41877  0 05:41 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   57097  41877  0 05:41 ?00:00:04 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8451) con782 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   63159  41877  0 05:43 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(11474) 

[jira] [Commented] (HAWQ-1342) QE process hang in shared input scan on segment node

2017-02-22 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879968#comment-15879968
 ] 

Ming LI commented on HAWQ-1342:
---

The basic idea for this kinds of hung problem is to:
(1) The error thrown segment will invoke rollback the whole transaction, and 
all related fd will be closed during transaction end.
(2) The other segment just act as before, when wait for select(), it will loop 
until the specific fd is closed, then the code will run until process interrupt 
(the rollback transaction will send cancel signal) again in other place 
afterward.

So some previous fix (HAWQ-166,  HAWQ-1282) will be changed accordingly.
(1) HAWQ-166: we don't need to skip sending info
(2) HAWQ-1282:
  - we don't need to close the fd, it will be closed automatically during 
transaction end.
  - we just end loop if we find the related FD has already been closed.

> QE process hang in shared input scan on segment node
> 
>
> Key: HAWQ-1342
> URL: https://issues.apache.org/jira/browse/HAWQ-1342
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Query Execution
>Affects Versions: 2.0.0.0-incubating
>Reporter: Amy
>Assignee: Ming LI
> Fix For: backlog
>
>
> QE process hang on some segment node while QD and QE on other segment nodes 
> terminated.
> {code}
> [gpadmin@test1 ~]$ cat hostfile
> test1   master   secondary namenode
> test2   segment   datanode
> test3   segment   datanode
> test4   segment   datanode
> test5   segment   namenode
> [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep
> gpadmin   41877  1  0 05:35 ?00:01:04 
> /usr/local/hawq_2_1_0_0/bin/postgres -D 
> /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd
>  -i -M segment -p 20100 --silent-mode=true
> gpadmin   41878  41877  0 05:35 ?00:00:02 postgres: port 20100, 
> logger process
> gpadmin   41881  41877  0 05:35 ?00:00:00 postgres: port 20100, stats 
> collector process
> gpadmin   41882  41877  0 05:35 ?00:00:07 postgres: port 20100, 
> writer process
> gpadmin   41883  41877  0 05:35 ?00:00:01 postgres: port 20100, 
> checkpoint process
> gpadmin   41884  41877  0 05:35 ?00:00:11 postgres: port 20100, 
> segment resource manager
> gpadmin   42108  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   42416  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   44807  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC 
> SELECT
> gpadmin   44819  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   44821  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC 
> SELECT
> gpadmin   45447  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   49859  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC 
> SELECT
> gpadmin   49881  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51937  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51939  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   51941  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   51943  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC 
> SELECT
> gpadmin   51953  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC 
> SELECT
> gpadmin   53436  41877  0 05:40 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC 
> SELECT
> gpadmin   57095  41877  0 05:41 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   57097  41877  0 05:41 ?00:00:04 postgres: port 20100, 

[jira] [Assigned] (HAWQ-1342) QE process hang in shared input scan on segment node

2017-02-22 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1342:
-

Assignee: Ming LI  (was: Amy)

> QE process hang in shared input scan on segment node
> 
>
> Key: HAWQ-1342
> URL: https://issues.apache.org/jira/browse/HAWQ-1342
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Query Execution
>Affects Versions: 2.0.0.0-incubating
>Reporter: Amy
>Assignee: Ming LI
> Fix For: backlog
>
>
> QE process hang on some segment node while QD and QE on other segment nodes 
> terminated.
> {code}
> [gpadmin@test1 ~]$ cat hostfile
> test1   master   secondary namenode
> test2   segment   datanode
> test3   segment   datanode
> test4   segment   datanode
> test5   segment   namenode
> [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep
> gpadmin   41877  1  0 05:35 ?00:01:04 
> /usr/local/hawq_2_1_0_0/bin/postgres -D 
> /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd
>  -i -M segment -p 20100 --silent-mode=true
> gpadmin   41878  41877  0 05:35 ?00:00:02 postgres: port 20100, 
> logger process
> gpadmin   41881  41877  0 05:35 ?00:00:00 postgres: port 20100, stats 
> collector process
> gpadmin   41882  41877  0 05:35 ?00:00:07 postgres: port 20100, 
> writer process
> gpadmin   41883  41877  0 05:35 ?00:00:01 postgres: port 20100, 
> checkpoint process
> gpadmin   41884  41877  0 05:35 ?00:00:11 postgres: port 20100, 
> segment resource manager
> gpadmin   42108  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   42416  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   44807  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC 
> SELECT
> gpadmin   44819  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   44821  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC 
> SELECT
> gpadmin   45447  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   49859  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC 
> SELECT
> gpadmin   49881  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51937  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51939  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   51941  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   51943  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC 
> SELECT
> gpadmin   51953  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC 
> SELECT
> gpadmin   53436  41877  0 05:40 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC 
> SELECT
> gpadmin   57095  41877  0 05:41 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   57097  41877  0 05:41 ?00:00:04 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8451) con782 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   63159  41877  0 05:43 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(11474) con1082 seg0 cmd2 slice15 
> MPPEXEC SELECT
> gpadmin   64018  41877  0 05:44 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(11905) con1121 seg0 cmd2 slice5 MPPEXEC 
> SELECT
> {code}
> The stack info is as below and it seems that QE hang in shared input scan.
> {code}
> [gpadmin@test3 ~]$ gdb -p 42108
> (gdb) info threads
>   2 Thread 0x7f4f6b335700 (LWP 42109)  0x0032214df283 in poll () from 
> /lib64/libc.so.6
> * 1 Thread 0x7f4f9041c920 (LWP 42108)  0x0032214e1523 in 

[jira] [Assigned] (HAWQ-1339) Cache lookup failed after explain OLAP grouping query

2017-02-22 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1339:
-

Assignee: Ed Espino  (was: Ming LI)

> Cache lookup failed after explain OLAP grouping query
> -
>
> Key: HAWQ-1339
> URL: https://issues.apache.org/jira/browse/HAWQ-1339
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Amy
>Assignee: Ed Espino
> Fix For: backlog
>
> Attachments: olap_setup.sql
>
>
> Some OLAP grouping query may error out with "division by zero", and when do 
> query explain, notice of "cache lookup failed for attribute 7 of relation 
> 75036 (lsyscache.c:437)" occurred.
> {code}
> postgres=# SELECT sale.vn,sale.cn,sale.dt,GROUPING(sale.vn), 
> TO_CHAR(COALESCE(MAX(DISTINCT 
> floor(sale.vn+sale.qty)),0),'.999'),TO_CHAR(COALESCE(VAR_SAMP(floor(sale.pn/sale.prc)),0),'.999'),TO_CHAR(COALESCE(COUNT(floor(sale.qty+sale.prc)),0),'.999')
> postgres-# FROM sale,customer,vendor
> postgres-# WHERE sale.cn=customer.cn AND sale.vn=vendor.vn
> postgres-# GROUP BY 
> ROLLUP((sale.prc),(sale.vn,sale.vn),(sale.pn,sale.pn),(sale.dt),(sale.qty,sale.vn,sale.qty)),ROLLUP((sale.pn),(sale.vn,sale.pn),(sale.qty)),(),sale.cn
>  HAVING COALESCE(VAR_POP(sale.cn),0) >= 45.5839785564113;
> ERROR:  division by zero  (seg0 localhost:4 pid=25205)
> postgres=#
> postgres=# explain SELECT sale.vn,sale.cn,sale.dt,GROUPING(sale.vn), 
> TO_CHAR(COALESCE(MAX(DISTINCT 
> floor(sale.vn+sale.qty)),0),'.999'),TO_CHAR(COALESCE(VAR_SAMP(floor(sale.pn/sale.prc)),0),'.999'),TO_CHAR(COALESCE(COUNT(floor(sale.qty+sale.prc)),0),'.999')
> FROM sale,customer,vendor
> WHERE sale.cn=customer.cn AND sale.vn=vendor.vn
> GROUP BY 
> ROLLUP((sale.prc),(sale.vn,sale.vn),(sale.pn,sale.pn),(sale.dt),(sale.qty,sale.vn,sale.qty)),ROLLUP((sale.pn),(sale.vn,sale.pn),(sale.qty)),(),sale.cn
>  HAVING COALESCE(VAR_POP(sale.cn),0) >= 45.5839785564113;
> NOTICE:  cache lookup failed for attribute 7 of relation 75036 
> (lsyscache.c:437)
> {code}
> The reproduction steps are:
> {code}
> Step 1: Prepare schema and data using attached olap_setup.sql
> Step 2: Run below OLAP grouping query
> -- OLAP query involving MAX() function
> SELECT sale.vn,sale.cn,sale.dt,GROUPING(sale.vn), 
> TO_CHAR(COALESCE(MAX(DISTINCT 
> floor(sale.vn+sale.qty)),0),'.999'),TO_CHAR(COALESCE(VAR_SAMP(floor(sale.pn/sale.prc)),0),'.999'),TO_CHAR(COALESCE(COUNT(floor(sale.qty+sale.prc)),0),'.999')
> FROM sale,customer,vendor
> WHERE sale.cn=customer.cn AND sale.vn=vendor.vn
> GROUP BY 
> ROLLUP((sale.prc),(sale.vn,sale.vn),(sale.pn,sale.pn),(sale.dt),(sale.qty,sale.vn,sale.qty)),ROLLUP((sale.pn),(sale.vn,sale.pn),(sale.qty)),(),sale.cn
>  HAVING COALESCE(VAR_POP(sale.cn),0) >= 45.5839785564113;
> explain SELECT sale.vn,sale.cn,sale.dt,GROUPING(sale.vn), 
> TO_CHAR(COALESCE(MAX(DISTINCT 
> floor(sale.vn+sale.qty)),0),'.999'),TO_CHAR(COALESCE(VAR_SAMP(floor(sale.pn/sale.prc)),0),'.999'),TO_CHAR(COALESCE(COUNT(floor(sale.qty+sale.prc)),0),'.999')
> FROM sale,customer,vendor
> WHERE sale.cn=customer.cn AND sale.vn=vendor.vn
> GROUP BY 
> ROLLUP((sale.prc),(sale.vn,sale.vn),(sale.pn,sale.pn),(sale.dt),(sale.qty,sale.vn,sale.qty)),ROLLUP((sale.pn),(sale.vn,sale.pn),(sale.qty)),(),sale.cn
>  HAVING COALESCE(VAR_POP(sale.cn),0) >= 45.5839785564113;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (HAWQ-1345) Cannot connect to PSQL: FATAL: could not count blocks of relation 1663/16508/1249: Not a directory

2017-02-22 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI resolved HAWQ-1345.
---
Resolution: Fixed

> Cannot connect to PSQL: FATAL: could not count blocks of relation 
> 1663/16508/1249: Not a directory
> --
>
> Key: HAWQ-1345
> URL: https://issues.apache.org/jira/browse/HAWQ-1345
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Catalog
>Affects Versions: 2.0.0.0-incubating
>Reporter: Amy
>Assignee: Ming LI
> Fix For: backlog
>
>
> Unable to connect to psql for current database. 
> We can access psql for template1 database but for current database we are 
> getting the following error:
> {code}
> #psql 
> psql: FATAL: could not count blocks of relation 1663/16508/1249: Not a 
> directory
> {code}
> When trying to failover to Standby and starting HAWQ Master we get the 
> following error again:
> {code}
> 2017-02-17 02:12:50.119207 
> PST,,,p22482,th-16818971840,,,seg-1,"DEBUG1","0","opening 
> ""pg_xlog/00010005001D"" for readin
> g (log 5, seg 29)",,,0,,"xlog.c",3162,
> 2017-02-17 02:12:50.176450 
> PST,,,p22482,th-16818971840,,,seg-1,"FATAL","42809","could not 
> count blocks of relation 1663/16508/1249: Not
> a directory","xlog redo insert: rel 1663/16508/1249; tid 32682/85
> REDO PASS 3 @ 5/7669B838; LSN 5/7669E480: prev 5/76694C98; xid 825193; bkpb1: 
> Heap - insert: rel 1663/16508/1249; tid 32682/85",,0,,"smgr.c",1146,"
> Stack trace:
> 10x8c5628 postgres errstart + 0x288
> 20x7ddfbc postgres smgrnblocks + 0x3c
> 30x4fbdf8 postgres XLogReadBuffer + 0x18
> 40x4ea2c9 postgres  + 0x4ea2c9
> 50x4eaf47 postgres  + 0x4eaf47
> 60x4f8af3 postgres StartupXLOG_Pass3 + 0x153
> 70x4fb277 postgres StartupProcessMain + 0x187
> 80x557cd8 postgres AuxiliaryProcessMain + 0x478
> 90x793c40 postgres  + 0x793c40
> 10   0x798901 postgres  + 0x798901
> 11   0x79a8c9 postgres PostmasterMain + 0x759
> 12   0x4a4039 postgres main + 0x519
> 13   0x7f3b979e1d5d libc.so.6 __libc_start_main + 0xfd
> 14   0x4a40b9 postgres  + 0x4a40b9
> "
> {code}
> On both Master and Standby, we can see that pg_attribute for current 
> database, file 1663/16508/1249 has reached 1GB in size:
> {code}
> [gpadmin@master]$pwd
> /data/hawq/master
> [gpadmin@master master]$ cd  base
> [gpadmin@master base]$ ls
> 1  16386  16387  16508
> [gpadmin@master base]$ cd 16508
> [gpadmin@master 16508]$ ls -thrl 1249
> -rw--- 1 gpadmin gpadmin 1.0G Feb 16 18:24 1249
> {code}
> From strace we were able to find the following:
> {code}
> [gpadmin@master master]$ strace  /usr/local/hawq/bin/postgres --single -P -O 
> -p 5432 -D $MASTER_DATA_DIRECTORY -c gp_session_role=utility currentdatabase 
> < select version();
> EOF
> (...)
> open("base/16508/pg_internal.init", O_RDONLY) = -1 ENOENT (No such file or 
> directory)
> open("base/16508/1259", O_RDWR) = 6
> lseek(6, 0, SEEK_END)   = 188645376
> lseek(6, 0, SEEK_SET)   = 0
> read(6, 
> "\0\0\0\0\340\5\327\1\1\0\1\0\f\3@\3\0\200\4\2008\263P\1`\262\252\1\270\261P\1"...,
>  32768) = 32768
> open("base/16508/1249", O_RDWR) = 8
> lseek(8, 0, SEEK_END)   = 1073741824
> open("base/16508/1249/1", O_RDWR)   = -1 ENOTDIR (Not a directory)
> open("base/16508/1249/1", O_RDWR|O_CREAT, 0600) = -1 ENOTDIR (Not a directory)
> futex(0x7ff80e53f620, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> futex(0x7ff80e756af0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> open("/usr/share/locale/locale.alias", O_RDONLY) = 10
> fstat(10, {st_mode=S_IFREG|0644, st_size=2512, ...}) = 0
> {code}
> We see HAWQ is treating pg_attribute as a directory while it is a file.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HAWQ-1342) QE process hang in shared input scan on segment node

2017-02-19 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI updated HAWQ-1342:
--
Affects Version/s: 2.0.0.0-incubating

> QE process hang in shared input scan on segment node
> 
>
> Key: HAWQ-1342
> URL: https://issues.apache.org/jira/browse/HAWQ-1342
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Query Execution
>Affects Versions: 2.0.0.0-incubating
>Reporter: Amy
>Assignee: Amy
> Fix For: backlog
>
>
> QE process hang on some segment node while QD and QE on other segment nodes 
> terminated.
> {code}
> [gpadmin@test1 ~]$ cat hostfile
> test1   master   secondary namenode
> test2   segment   datanode
> test3   segment   datanode
> test4   segment   datanode
> test5   segment   namenode
> [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep
> gpadmin   41877  1  0 05:35 ?00:01:04 
> /usr/local/hawq_2_1_0_0/bin/postgres -D 
> /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd
>  -i -M segment -p 20100 --silent-mode=true
> gpadmin   41878  41877  0 05:35 ?00:00:02 postgres: port 20100, 
> logger process
> gpadmin   41881  41877  0 05:35 ?00:00:00 postgres: port 20100, stats 
> collector process
> gpadmin   41882  41877  0 05:35 ?00:00:07 postgres: port 20100, 
> writer process
> gpadmin   41883  41877  0 05:35 ?00:00:01 postgres: port 20100, 
> checkpoint process
> gpadmin   41884  41877  0 05:35 ?00:00:11 postgres: port 20100, 
> segment resource manager
> gpadmin   42108  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   42416  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   44807  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC 
> SELECT
> gpadmin   44819  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   44821  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC 
> SELECT
> gpadmin   45447  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   49859  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC 
> SELECT
> gpadmin   49881  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51937  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51939  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   51941  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   51943  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC 
> SELECT
> gpadmin   51953  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC 
> SELECT
> gpadmin   53436  41877  0 05:40 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC 
> SELECT
> gpadmin   57095  41877  0 05:41 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   57097  41877  0 05:41 ?00:00:04 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8451) con782 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   63159  41877  0 05:43 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(11474) con1082 seg0 cmd2 slice15 
> MPPEXEC SELECT
> gpadmin   64018  41877  0 05:44 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(11905) con1121 seg0 cmd2 slice5 MPPEXEC 
> SELECT
> {code}
> The stack info is as below and it seems that QE hang in shared input scan.
> {code}
> [gpadmin@test3 ~]$ gdb -p 42108
> (gdb) info threads
>   2 Thread 0x7f4f6b335700 (LWP 42109)  0x0032214df283 in poll () from 
> /lib64/libc.so.6
> * 1 Thread 0x7f4f9041c920 (LWP 42108)  0x0032214e1523 in select 

[jira] [Updated] (HAWQ-1345) Cannot connect to PSQL: FATAL: could not count blocks of relation 1663/16508/1249: Not a directory

2017-02-19 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI updated HAWQ-1345:
--
Affects Version/s: 2.0.0.0-incubating

> Cannot connect to PSQL: FATAL: could not count blocks of relation 
> 1663/16508/1249: Not a directory
> --
>
> Key: HAWQ-1345
> URL: https://issues.apache.org/jira/browse/HAWQ-1345
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Catalog
>Affects Versions: 2.0.0.0-incubating
>Reporter: Amy
>Assignee: Ming LI
> Fix For: backlog
>
>
> Unable to connect to psql for current database. 
> We can access psql for template1 database but for current database we are 
> getting the following error:
> {code}
> #psql 
> psql: FATAL: could not count blocks of relation 1663/16508/1249: Not a 
> directory
> {code}
> When trying to failover to Standby and starting HAWQ Master we get the 
> following error again:
> {code}
> 2017-02-17 02:12:50.119207 
> PST,,,p22482,th-16818971840,,,seg-1,"DEBUG1","0","opening 
> ""pg_xlog/00010005001D"" for readin
> g (log 5, seg 29)",,,0,,"xlog.c",3162,
> 2017-02-17 02:12:50.176450 
> PST,,,p22482,th-16818971840,,,seg-1,"FATAL","42809","could not 
> count blocks of relation 1663/16508/1249: Not
> a directory","xlog redo insert: rel 1663/16508/1249; tid 32682/85
> REDO PASS 3 @ 5/7669B838; LSN 5/7669E480: prev 5/76694C98; xid 825193; bkpb1: 
> Heap - insert: rel 1663/16508/1249; tid 32682/85",,0,,"smgr.c",1146,"
> Stack trace:
> 10x8c5628 postgres errstart + 0x288
> 20x7ddfbc postgres smgrnblocks + 0x3c
> 30x4fbdf8 postgres XLogReadBuffer + 0x18
> 40x4ea2c9 postgres  + 0x4ea2c9
> 50x4eaf47 postgres  + 0x4eaf47
> 60x4f8af3 postgres StartupXLOG_Pass3 + 0x153
> 70x4fb277 postgres StartupProcessMain + 0x187
> 80x557cd8 postgres AuxiliaryProcessMain + 0x478
> 90x793c40 postgres  + 0x793c40
> 10   0x798901 postgres  + 0x798901
> 11   0x79a8c9 postgres PostmasterMain + 0x759
> 12   0x4a4039 postgres main + 0x519
> 13   0x7f3b979e1d5d libc.so.6 __libc_start_main + 0xfd
> 14   0x4a40b9 postgres  + 0x4a40b9
> "
> {code}
> On both Master and Standby, we can see that pg_attribute for current 
> database, file 1663/16508/1249 has reached 1GB in size:
> {code}
> [gpadmin@master]$pwd
> /data/hawq/master
> [gpadmin@master master]$ cd  base
> [gpadmin@master base]$ ls
> 1  16386  16387  16508
> [gpadmin@master base]$ cd 16508
> [gpadmin@master 16508]$ ls -thrl 1249
> -rw--- 1 gpadmin gpadmin 1.0G Feb 16 18:24 1249
> {code}
> From strace we were able to find the following:
> {code}
> [gpadmin@master master]$ strace  /usr/local/hawq/bin/postgres --single -P -O 
> -p 5432 -D $MASTER_DATA_DIRECTORY -c gp_session_role=utility currentdatabase 
> < select version();
> EOF
> (...)
> open("base/16508/pg_internal.init", O_RDONLY) = -1 ENOENT (No such file or 
> directory)
> open("base/16508/1259", O_RDWR) = 6
> lseek(6, 0, SEEK_END)   = 188645376
> lseek(6, 0, SEEK_SET)   = 0
> read(6, 
> "\0\0\0\0\340\5\327\1\1\0\1\0\f\3@\3\0\200\4\2008\263P\1`\262\252\1\270\261P\1"...,
>  32768) = 32768
> open("base/16508/1249", O_RDWR) = 8
> lseek(8, 0, SEEK_END)   = 1073741824
> open("base/16508/1249/1", O_RDWR)   = -1 ENOTDIR (Not a directory)
> open("base/16508/1249/1", O_RDWR|O_CREAT, 0600) = -1 ENOTDIR (Not a directory)
> futex(0x7ff80e53f620, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> futex(0x7ff80e756af0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> open("/usr/share/locale/locale.alias", O_RDONLY) = 10
> fstat(10, {st_mode=S_IFREG|0644, st_size=2512, ...}) = 0
> {code}
> We see HAWQ is treating pg_attribute as a directory while it is a file.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (HAWQ-1345) Cannot connect to PSQL: FATAL: could not count blocks of relation 1663/16508/1249: Not a directory

2017-02-19 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1345:
-

Assignee: Ming LI  (was: Amy)

> Cannot connect to PSQL: FATAL: could not count blocks of relation 
> 1663/16508/1249: Not a directory
> --
>
> Key: HAWQ-1345
> URL: https://issues.apache.org/jira/browse/HAWQ-1345
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Amy
>Assignee: Ming LI
> Fix For: 3.0.0.0
>
>
> Unable to connect to psql for current database. 
> We can access psql for template1 database but for current database we are 
> getting the following error:
> {code}
> #psql 
> psql: FATAL: could not count blocks of relation 1663/16508/1249: Not a 
> directory
> {code}
> When trying to failover to Standby and starting HAWQ Master we get the 
> following error again:
> {code}
> 2017-02-17 02:12:50.119207 
> PST,,,p22482,th-16818971840,,,seg-1,"DEBUG1","0","opening 
> ""pg_xlog/00010005001D"" for readin
> g (log 5, seg 29)",,,0,,"xlog.c",3162,
> 2017-02-17 02:12:50.176450 
> PST,,,p22482,th-16818971840,,,seg-1,"FATAL","42809","could not 
> count blocks of relation 1663/16508/1249: Not
> a directory","xlog redo insert: rel 1663/16508/1249; tid 32682/85
> REDO PASS 3 @ 5/7669B838; LSN 5/7669E480: prev 5/76694C98; xid 825193; bkpb1: 
> Heap - insert: rel 1663/16508/1249; tid 32682/85",,0,,"smgr.c",1146,"
> Stack trace:
> 10x8c5628 postgres errstart + 0x288
> 20x7ddfbc postgres smgrnblocks + 0x3c
> 30x4fbdf8 postgres XLogReadBuffer + 0x18
> 40x4ea2c9 postgres  + 0x4ea2c9
> 50x4eaf47 postgres  + 0x4eaf47
> 60x4f8af3 postgres StartupXLOG_Pass3 + 0x153
> 70x4fb277 postgres StartupProcessMain + 0x187
> 80x557cd8 postgres AuxiliaryProcessMain + 0x478
> 90x793c40 postgres  + 0x793c40
> 10   0x798901 postgres  + 0x798901
> 11   0x79a8c9 postgres PostmasterMain + 0x759
> 12   0x4a4039 postgres main + 0x519
> 13   0x7f3b979e1d5d libc.so.6 __libc_start_main + 0xfd
> 14   0x4a40b9 postgres  + 0x4a40b9
> "
> {code}
> On both Master and Standby, we can see that pg_attribute for current 
> database, file 1663/16508/1249 has reached 1GB in size:
> {code}
> [gpadmin@master]$pwd
> /data/hawq/master
> [gpadmin@master master]$ cd  base
> [gpadmin@master base]$ ls
> 1  16386  16387  16508
> [gpadmin@master base]$ cd 16508
> [gpadmin@master 16508]$ ls -thrl 1249
> -rw--- 1 gpadmin gpadmin 1.0G Feb 16 18:24 1249
> {code}
> From strace we were able to find the following:
> {code}
> [gpadmin@master master]$ strace  /usr/local/hawq/bin/postgres --single -P -O 
> -p 5432 -D $MASTER_DATA_DIRECTORY -c gp_session_role=utility currentdatabase 
> < select version();
> EOF
> (...)
> open("base/16508/pg_internal.init", O_RDONLY) = -1 ENOENT (No such file or 
> directory)
> open("base/16508/1259", O_RDWR) = 6
> lseek(6, 0, SEEK_END)   = 188645376
> lseek(6, 0, SEEK_SET)   = 0
> read(6, 
> "\0\0\0\0\340\5\327\1\1\0\1\0\f\3@\3\0\200\4\2008\263P\1`\262\252\1\270\261P\1"...,
>  32768) = 32768
> open("base/16508/1249", O_RDWR) = 8
> lseek(8, 0, SEEK_END)   = 1073741824
> open("base/16508/1249/1", O_RDWR)   = -1 ENOTDIR (Not a directory)
> open("base/16508/1249/1", O_RDWR|O_CREAT, 0600) = -1 ENOTDIR (Not a directory)
> futex(0x7ff80e53f620, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> futex(0x7ff80e756af0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> open("/usr/share/locale/locale.alias", O_RDONLY) = 10
> fstat(10, {st_mode=S_IFREG|0644, st_size=2512, ...}) = 0
> {code}
> We see HAWQ is treating pg_attribute as a directory while it is a file.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (HAWQ-1338) In some case writer process crashed when running 'hawq stop cluster'

2017-02-15 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI resolved HAWQ-1338.
---
   Resolution: Fixed
Fix Version/s: backlog

> In some case writer process crashed when running 'hawq stop cluster'
> 
>
> Key: HAWQ-1338
> URL: https://issues.apache.org/jira/browse/HAWQ-1338
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: backlog
>
>
> On master node of test machine, some process doesn't exit nicely, and core 
> dump after a while.
> {code}
> --- The running log  -
> 2/12/17 11:33:59 PM PST: 
> --
> 2/12/17 11:33:59 PM PST: Check if postgres/java processes are closed properly:
> 2/12/17 11:33:59 PM PST: 
> --
> 2/12/17 11:33:59 PM PST: Check if postgres|java process is running on test1: 
> 2/12/17 11:33:59 PM PST: gpadmin5279  1  0 22:53 ?00:00:03 
> postgres: port 31000, master logger process   
>   
>   
> 2/12/17 11:33:59 PM PST: gpadmin5283  1  0 22:53 ?00:00:01 
> postgres: port 31000, writer process  
>   
>   
> 2/12/17 11:33:59 PM PST: root  23864 24  1 23:37 ?00:00:01 
> /usr/libexec/abrt-hook-ccpp 6 18446744073709551615 5283 501 501 1486971433 
> postgres
> 2/12/17 11:33:59 PM PST: -
> 2/12/17 11:33:59 PM PST: Check if postgres|java process is running on test2: 
> 2/12/17 11:33:59 PM PST: -
> 2/12/17 11:33:59 PM PST: Check if postgres|java process is running on test3: 
> 2/12/17 11:33:59 PM PST: -
> 2/12/17 11:33:59 PM PST: Check if postgres|java process is running on test4: 
> 2/12/17 11:33:59 PM PST: -
> 2/12/17 11:33:59 PM PST: Check if postgres|java process is running on test5: 
> 2/12/17 11:33:59 PM PST: -
> 2/12/17 11:33:59 PM PST: ERROR: Postgres process not closed on test1, please 
> check.
> 2/12/17 11:33:59 PM PST: 
> --
> --- The call stack -
> (gdb) bt
> #0  0x0032214325e5 in raise () from /lib64/libc.so.6
> #1  0x003221433dc5 in abort () from /lib64/libc.so.6
> #2  0x0096433a in errfinish (dummy=0) at elog.c:686
> #3  0x009665bd in elog_finish (elevel=22, fmt=0xc53af0 "process is 
> dying from critical section") at elog.c:1463
> #4  0x0086c11d in proc_exit_prepare (code=1) at ipc.c:153
> #5  0x0086c0a9 in proc_exit (code=1) at ipc.c:93
> #6  0x00964300 in errfinish (dummy=0) at elog.c:670
> #7  0x00825121 in ServiceClientRead (serviceClient=0xfc73f0, 
> response=0x7fffb96842de, responseLen=1,
> timeout=0x7fffb96842c0) at service.c:523
> #8  0x00824f7b in ServiceClientReceiveResponse 
> (serviceClient=0xfc73f0, response=0x7fffb96842de, responseLen=1,
> timeout=0x7fffb96842c0) at service.c:480
> #9  0x0082bce1 in WalSendServerClientReceiveResponse 
> (walSendResponse=0x7fffb96842de, timeout=0x7fffb96842c0)
> at walsendserver.c:372
> #10 0x0051596d in XLogQDMirrorWaitForResponse (waitForever=0 '\000') 
> at xlog.c:1919
> #11 0x00515c0c in XLogQDMirrorWrite (startidx=0, npages=1, 
> timeLineID=1, logId=0, logSeg=1, logOff=13729792)
> at xlog.c:2005
> #12 0x00516615 in XLogWrite (WriteRqst=..., flexible=0 '\000', 
> xlog_switch=0 '\000') at xlog.c:2354
> #13 0x00516d68 in XLogFlush (record=...) at xlog.c:2572
> #14 0x00522f88 in CreateCheckPoint (shutdown=1 '\001', force=1 
> '\001') at xlog.c:8136
> #15 0x0052277b in ShutdownXLOG (code=0, arg=0) at xlog.c:7865
> #16 0x00821f42 in BackgroundWriterMain () at bgwriter.c:318
> #17 0x0059c9f1 in AuxiliaryProcessMain (argc=2, argv=0x7fffb9684b60) 
> at bootstrap.c:467
> #18 0x0083c7b0 in StartChildProcess (type=BgWriterProcess) at 
> postmaster.c:6836
> #19 0x00838f39 in CommenceNormalOperations () at postmaster.c:3618
> #20 0x0083984a in do_reaper () at postmaster.c:4021
> #21 0x00835e97 in ServerLoop () at postmaster.c:2136
> #22 0x0083500f in PostmasterMain (argc=9, argv=0x288bd10) at 
> postmaster.c:1454
> #23 0x007612af in main (argc=9, argv=0x288bd10) at main.c:226
> {code}



--
This 

[jira] [Commented] (HAWQ-1338) In some case writer process crashed when running 'hawq stop cluster'

2017-02-15 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869175#comment-15869175
 ] 

Ming LI commented on HAWQ-1338:
---

(1) From the bt in the core file, it seems that SuppressPanic is true, we still 
report panic even at process exist function. We should suppress the panic if 
there is no side effect. 
(2) Now hawq stop utility stop standby after master stopped successfully or 
failed to stop with 10 minutes timeout, so it seems no need to change.

> In some case writer process crashed when running 'hawq stop cluster'
> 
>
> Key: HAWQ-1338
> URL: https://issues.apache.org/jira/browse/HAWQ-1338
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ming LI
>
> On master node of test machine, some process doesn't exit nicely, and core 
> dump after a while.
> {code}
> --- The running log  -
> 2/12/17 11:33:59 PM PST: 
> --
> 2/12/17 11:33:59 PM PST: Check if postgres/java processes are closed properly:
> 2/12/17 11:33:59 PM PST: 
> --
> 2/12/17 11:33:59 PM PST: Check if postgres|java process is running on test1: 
> 2/12/17 11:33:59 PM PST: gpadmin5279  1  0 22:53 ?00:00:03 
> postgres: port 31000, master logger process   
>   
>   
> 2/12/17 11:33:59 PM PST: gpadmin5283  1  0 22:53 ?00:00:01 
> postgres: port 31000, writer process  
>   
>   
> 2/12/17 11:33:59 PM PST: root  23864 24  1 23:37 ?00:00:01 
> /usr/libexec/abrt-hook-ccpp 6 18446744073709551615 5283 501 501 1486971433 
> postgres
> 2/12/17 11:33:59 PM PST: -
> 2/12/17 11:33:59 PM PST: Check if postgres|java process is running on test2: 
> 2/12/17 11:33:59 PM PST: -
> 2/12/17 11:33:59 PM PST: Check if postgres|java process is running on test3: 
> 2/12/17 11:33:59 PM PST: -
> 2/12/17 11:33:59 PM PST: Check if postgres|java process is running on test4: 
> 2/12/17 11:33:59 PM PST: -
> 2/12/17 11:33:59 PM PST: Check if postgres|java process is running on test5: 
> 2/12/17 11:33:59 PM PST: -
> 2/12/17 11:33:59 PM PST: ERROR: Postgres process not closed on test1, please 
> check.
> 2/12/17 11:33:59 PM PST: 
> --
> --- The call stack -
> (gdb) bt
> #0  0x0032214325e5 in raise () from /lib64/libc.so.6
> #1  0x003221433dc5 in abort () from /lib64/libc.so.6
> #2  0x0096433a in errfinish (dummy=0) at elog.c:686
> #3  0x009665bd in elog_finish (elevel=22, fmt=0xc53af0 "process is 
> dying from critical section") at elog.c:1463
> #4  0x0086c11d in proc_exit_prepare (code=1) at ipc.c:153
> #5  0x0086c0a9 in proc_exit (code=1) at ipc.c:93
> #6  0x00964300 in errfinish (dummy=0) at elog.c:670
> #7  0x00825121 in ServiceClientRead (serviceClient=0xfc73f0, 
> response=0x7fffb96842de, responseLen=1,
> timeout=0x7fffb96842c0) at service.c:523
> #8  0x00824f7b in ServiceClientReceiveResponse 
> (serviceClient=0xfc73f0, response=0x7fffb96842de, responseLen=1,
> timeout=0x7fffb96842c0) at service.c:480
> #9  0x0082bce1 in WalSendServerClientReceiveResponse 
> (walSendResponse=0x7fffb96842de, timeout=0x7fffb96842c0)
> at walsendserver.c:372
> #10 0x0051596d in XLogQDMirrorWaitForResponse (waitForever=0 '\000') 
> at xlog.c:1919
> #11 0x00515c0c in XLogQDMirrorWrite (startidx=0, npages=1, 
> timeLineID=1, logId=0, logSeg=1, logOff=13729792)
> at xlog.c:2005
> #12 0x00516615 in XLogWrite (WriteRqst=..., flexible=0 '\000', 
> xlog_switch=0 '\000') at xlog.c:2354
> #13 0x00516d68 in XLogFlush (record=...) at xlog.c:2572
> #14 0x00522f88 in CreateCheckPoint (shutdown=1 '\001', force=1 
> '\001') at xlog.c:8136
> #15 0x0052277b in ShutdownXLOG (code=0, arg=0) at xlog.c:7865
> #16 0x00821f42 in BackgroundWriterMain () at bgwriter.c:318
> #17 0x0059c9f1 in AuxiliaryProcessMain (argc=2, argv=0x7fffb9684b60) 
> at bootstrap.c:467
> #18 0x0083c7b0 in StartChildProcess (type=BgWriterProcess) at 
> postmaster.c:6836
> #19 0x00838f39 in CommenceNormalOperations () at postmaster.c:3618
> #20 0x0083984a in 

[jira] [Created] (HAWQ-1338) In some case writer process crashed when running 'hawq stop cluster'

2017-02-15 Thread Ming LI (JIRA)
Ming LI created HAWQ-1338:
-

 Summary: In some case writer process crashed when running 'hawq 
stop cluster'
 Key: HAWQ-1338
 URL: https://issues.apache.org/jira/browse/HAWQ-1338
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Core
Reporter: Ming LI
Assignee: Ed Espino


On master node of test machine, some process doesn't exit nicely, and core dump 
after a while.

{code}
--- The running log  -
2/12/17 11:33:59 PM PST: 
--
2/12/17 11:33:59 PM PST: Check if postgres/java processes are closed properly:
2/12/17 11:33:59 PM PST: 
--
2/12/17 11:33:59 PM PST: Check if postgres|java process is running on test1: 
2/12/17 11:33:59 PM PST: gpadmin5279  1  0 22:53 ?00:00:03 
postgres: port 31000, master logger process 
  
2/12/17 11:33:59 PM PST: gpadmin5283  1  0 22:53 ?00:00:01 
postgres: port 31000, writer process
  
2/12/17 11:33:59 PM PST: root  23864 24  1 23:37 ?00:00:01 
/usr/libexec/abrt-hook-ccpp 6 18446744073709551615 5283 501 501 1486971433 
postgres
2/12/17 11:33:59 PM PST: -
2/12/17 11:33:59 PM PST: Check if postgres|java process is running on test2: 
2/12/17 11:33:59 PM PST: -
2/12/17 11:33:59 PM PST: Check if postgres|java process is running on test3: 
2/12/17 11:33:59 PM PST: -
2/12/17 11:33:59 PM PST: Check if postgres|java process is running on test4: 
2/12/17 11:33:59 PM PST: -
2/12/17 11:33:59 PM PST: Check if postgres|java process is running on test5: 
2/12/17 11:33:59 PM PST: -
2/12/17 11:33:59 PM PST: ERROR: Postgres process not closed on test1, please 
check.
2/12/17 11:33:59 PM PST: 
--
--- The call stack -
(gdb) bt
#0  0x0032214325e5 in raise () from /lib64/libc.so.6
#1  0x003221433dc5 in abort () from /lib64/libc.so.6
#2  0x0096433a in errfinish (dummy=0) at elog.c:686
#3  0x009665bd in elog_finish (elevel=22, fmt=0xc53af0 "process is 
dying from critical section") at elog.c:1463
#4  0x0086c11d in proc_exit_prepare (code=1) at ipc.c:153
#5  0x0086c0a9 in proc_exit (code=1) at ipc.c:93
#6  0x00964300 in errfinish (dummy=0) at elog.c:670
#7  0x00825121 in ServiceClientRead (serviceClient=0xfc73f0, 
response=0x7fffb96842de, responseLen=1,
timeout=0x7fffb96842c0) at service.c:523
#8  0x00824f7b in ServiceClientReceiveResponse (serviceClient=0xfc73f0, 
response=0x7fffb96842de, responseLen=1,
timeout=0x7fffb96842c0) at service.c:480
#9  0x0082bce1 in WalSendServerClientReceiveResponse 
(walSendResponse=0x7fffb96842de, timeout=0x7fffb96842c0)
at walsendserver.c:372
#10 0x0051596d in XLogQDMirrorWaitForResponse (waitForever=0 '\000') at 
xlog.c:1919
#11 0x00515c0c in XLogQDMirrorWrite (startidx=0, npages=1, 
timeLineID=1, logId=0, logSeg=1, logOff=13729792)
at xlog.c:2005
#12 0x00516615 in XLogWrite (WriteRqst=..., flexible=0 '\000', 
xlog_switch=0 '\000') at xlog.c:2354
#13 0x00516d68 in XLogFlush (record=...) at xlog.c:2572
#14 0x00522f88 in CreateCheckPoint (shutdown=1 '\001', force=1 '\001') 
at xlog.c:8136
#15 0x0052277b in ShutdownXLOG (code=0, arg=0) at xlog.c:7865
#16 0x00821f42 in BackgroundWriterMain () at bgwriter.c:318
#17 0x0059c9f1 in AuxiliaryProcessMain (argc=2, argv=0x7fffb9684b60) at 
bootstrap.c:467
#18 0x0083c7b0 in StartChildProcess (type=BgWriterProcess) at 
postmaster.c:6836
#19 0x00838f39 in CommenceNormalOperations () at postmaster.c:3618
#20 0x0083984a in do_reaper () at postmaster.c:4021
#21 0x00835e97 in ServerLoop () at postmaster.c:2136
#22 0x0083500f in PostmasterMain (argc=9, argv=0x288bd10) at 
postmaster.c:1454
#23 0x007612af in main (argc=9, argv=0x288bd10) at main.c:226
{code}






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (HAWQ-1324) Query cancel cause segment to go into Crash recovery

2017-02-13 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI resolved HAWQ-1324.
---
   Resolution: Fixed
Fix Version/s: backlog

> Query cancel cause segment to go into Crash recovery
> 
>
> Key: HAWQ-1324
> URL: https://issues.apache.org/jira/browse/HAWQ-1324
> Project: Apache HAWQ
>  Issue Type: Bug
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: backlog
>
>
> A query was cancelled due to this connection issue to HDFS on Isilon. Seg26 
> then went into crash recovery due to a INSERT query being cancelled. What 
> should be the expected behaviour when HDFS becomes unavailable and a Query 
> fails due to HDFS unavailability.
> Below is the HDFS error
> {code}
> 2017-01-04 03:04:08.382615 
> JST,"carund","dwhrun",p574246,th1862944896,"192.168.10.12","47554",2017-01-04 
> 03:03:08 JST,0,con198952,,seg29,"FATAL","08006","connection to client 
> lost",,,0,,"postgres.c",3518,
> 2017-01-04 03:04:08.420099 
> JST,,,p755778,th18629448960,,,seg-1,"LOG","0","3rd party 
> error log:
> 2017-01-04 03:04:08.419969, p574222, th140507423066240, ERROR Handle 
> Exception: NamenodeImpl.cpp: 670: Unexpected error: status: 
> STATUS_FILE_NOT_AVAILABLE = 0xC467 Path: 
> hawq_default/16385/16563/802748/26 with path=
> ""/hawq_default/16385/16563/802748/26"", 
> clientname=libhdfs3_client_random_866998528_count_1_pid_574222_tid_140507423066240
> @ Hdfs::Internal::UnWrapper Hdfs::HdfsIOException, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing , 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing>::unwrap(char const, int)
> @ Hdfs::Internal::UnWrapper Hdfs::UnresolvedLinkException, Hdfs::HdfsIOException, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Not hing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing>::unwrap(char const, int)
> @ Hdfs::Internal::NamenodeImpl::fsync(std::string const&, std::string const&)
> @ Hdfs::Internal::NamenodeProxy::fsync(std::string const&, std::string const&)
> @ Hdfs::Internal::OutputStreamImpl::closePipeline()
> @ Hdfs::Internal::OutputStreamImpl::close()
> @ hdfsCloseFile
> @ gpfs_hdfs_closefile
> @ HdfsCloseFile
> @ HdfsFileClose
> @ CleanupTempFiles
> @ AbortTransaction
> @ AbortCurrentTransaction
> @ PostgresMain
> @ BackendStartup
> @ ServerLoop
> @ PostmasterMain
> @ main
> @ Unknown
> @ Unknown""SysLoggerMain","syslogger.c",518,
> 2017-01-04 03:04:08.420272 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 
> JST,40678725,con198952,cmd4,seg25,,,x40678725,sx1,"WARNING","58030","could 
> not close file 7 : (hdfs://ffd
> lakehd.ffwin.fujifilm.co.jp:8020/hawq_default/16385/16563/802748/26) errno 
> 5","Unexpected error: status: STATUS_FILE_NOT_AVAILABLE = 0xC467 Path: 
> hawq_default/16385/16563/802748/26 with path=""/hawq_default/16385/16
> 563/802748/26"", 
> clientname=libhdfs3_client_random_866998528_count_1_pid_574222_tid_140507423066240",,0,,"fd.c",2762,
> {code}
> Segment 26 going into Crash recovery - from seg26 log file
> {code}
> 2017-01-04 03:04:08.420314 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 
> JST,40678725,con198952,cmd4,seg25,,,x40678725,sx1,"LOG","08006","could not 
> send data to client: 接続が相
> 手からリセットされました",,,0,,"pqcomm.c",1292,
> 2017-01-04 03:04:08.420358 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 JST,0,con198952,,seg25,"LOG","08006","could not send data to 
> client: パイプが切断されました",,,0,
> ,"pqcomm.c",1292,
> 2017-01-04 03:04:08.420375 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 JST,0,con198952,,seg25,"FATAL","08006","connection to client 
> lost",,,0,,"postgres.c",3518,
> 2017-01-04 03:04:08.950354 
> JST,,,p755773,th18629448960,,,seg-1,"LOG","0","server process 
> (PID 574240) was terminated by signal 11: Segmentation 
> fault",,,0,,"postmaster.c",4748,
> 2017-01-04 03:04:08.950403 
> JST,,,p755773,th18629448960,,,seg-1,"LOG","0","terminating 
> any other active server processes",,,0,,"postmaster.c",4486,
> 2017-01-04 03:04:08.954044 
> JST,,,p41605,th18629448960,,,seg-1,"LOG","0","Segment RM 
> exits.",,,0,,"resourcemanager.c",340,
> 2017-01-04 03:04:08.954078 
> JST,,,p41605,th18629448960,,,seg-1,"LOG","0","Clean up 
> handler in message server is called.",,,0,,"rmcomm_MessageServer.c",105,
> 

[jira] [Created] (HAWQ-1324) Query cancel cause segment to go into Crash recovery

2017-02-13 Thread Ming LI (JIRA)
Ming LI created HAWQ-1324:
-

 Summary: Query cancel cause segment to go into Crash recovery
 Key: HAWQ-1324
 URL: https://issues.apache.org/jira/browse/HAWQ-1324
 Project: Apache HAWQ
  Issue Type: Bug
Reporter: Ming LI
Assignee: Ed Espino


A query was cancelled due to this connection issue to HDFS on Isilon. Seg26 
then went into crash recovery due to a INSERT query being cancelled. What 
should be the expected behaviour when HDFS becomes unavailable and a Query 
fails due to HDFS unavailability.
Below is the HDFS error
{code}
2017-01-04 03:04:08.382615 
JST,"carund","dwhrun",p574246,th1862944896,"192.168.10.12","47554",2017-01-04 
03:03:08 JST,0,con198952,,seg29,"FATAL","08006","connection to client 
lost",,,0,,"postgres.c",3518,
2017-01-04 03:04:08.420099 
JST,,,p755778,th18629448960,,,seg-1,"LOG","0","3rd party error 
log:
2017-01-04 03:04:08.419969, p574222, th140507423066240, ERROR Handle Exception: 
NamenodeImpl.cpp: 670: Unexpected error: status: STATUS_FILE_NOT_AVAILABLE = 
0xC467 Path: hawq_default/16385/16563/802748/26 with path=
""/hawq_default/16385/16563/802748/26"", 
clientname=libhdfs3_client_random_866998528_count_1_pid_574222_tid_140507423066240
@ Hdfs::Internal::UnWrapper::unwrap(char const, int)
@ Hdfs::Internal::UnWrapper::unwrap(char const, int)
@ Hdfs::Internal::NamenodeImpl::fsync(std::string const&, std::string const&)
@ Hdfs::Internal::NamenodeProxy::fsync(std::string const&, std::string const&)
@ Hdfs::Internal::OutputStreamImpl::closePipeline()
@ Hdfs::Internal::OutputStreamImpl::close()
@ hdfsCloseFile
@ gpfs_hdfs_closefile
@ HdfsCloseFile
@ HdfsFileClose
@ CleanupTempFiles
@ AbortTransaction
@ AbortCurrentTransaction
@ PostgresMain
@ BackendStartup
@ ServerLoop
@ PostmasterMain
@ main
@ Unknown
@ Unknown""SysLoggerMain","syslogger.c",518,
2017-01-04 03:04:08.420272 
JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
03:03:08 
JST,40678725,con198952,cmd4,seg25,,,x40678725,sx1,"WARNING","58030","could not 
close file 7 : (hdfs://ffd
lakehd.ffwin.fujifilm.co.jp:8020/hawq_default/16385/16563/802748/26) errno 
5","Unexpected error: status: STATUS_FILE_NOT_AVAILABLE = 0xC467 Path: 
hawq_default/16385/16563/802748/26 with path=""/hawq_default/16385/16
563/802748/26"", 
clientname=libhdfs3_client_random_866998528_count_1_pid_574222_tid_140507423066240",,0,,"fd.c",2762,
{code}
Segment 26 going into Crash recovery - from seg26 log file
{code}
2017-01-04 03:04:08.420314 
JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
03:03:08 JST,40678725,con198952,cmd4,seg25,,,x40678725,sx1,"LOG","08006","could 
not send data to client: 接続が相
手からリセットされました",,,0,,"pqcomm.c",1292,
2017-01-04 03:04:08.420358 
JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
03:03:08 JST,0,con198952,,seg25,"LOG","08006","could not send data to 
client: パイプが切断されました",,,0,
,"pqcomm.c",1292,
2017-01-04 03:04:08.420375 
JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
03:03:08 JST,0,con198952,,seg25,"FATAL","08006","connection to client 
lost",,,0,,"postgres.c",3518,
2017-01-04 03:04:08.950354 
JST,,,p755773,th18629448960,,,seg-1,"LOG","0","server process 
(PID 574240) was terminated by signal 11: Segmentation 
fault",,,0,,"postmaster.c",4748,
2017-01-04 03:04:08.950403 
JST,,,p755773,th18629448960,,,seg-1,"LOG","0","terminating any 
other active server processes",,,0,,"postmaster.c",4486,
2017-01-04 03:04:08.954044 
JST,,,p41605,th18629448960,,,seg-1,"LOG","0","Segment RM 
exits.",,,0,,"resourcemanager.c",340,
2017-01-04 03:04:08.954078 
JST,,,p41605,th18629448960,,,seg-1,"LOG","0","Clean up handler 
in message server is called.",,,0,,"rmcomm_MessageServer.c",105,
2017-01-04 03:04:08.972706 
JST,,,p574711,th1862944896,"192.168.10.12","48121",2017-01-04 03:04:08 
JST,0,,,seg-1,"LOG","0","PID 574308 in cancel request did not match 
any process",,,0,,"postmaster.c",3166
,
2017-01-04 03:04:08.976211 
JST,,,p574712,th1862944896,"192.168.10.12","48127",2017-01-04 03:04:08 
JST,0,,,seg-1,"LOG","0","PID 574320 in cancel request did not match 
any 

[jira] [Commented] (HAWQ-1324) Query cancel cause segment to go into Crash recovery

2017-02-13 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863267#comment-15863267
 ] 

Ming LI commented on HAWQ-1324:
---

Hi all,

The root cause is backtrace() is not a safe functions to call from a signal 
handler. 

It is similar to below problem:
http://stackoverflow.com/questions/6371028/what-makes-backtrace-crashsigsegv-on-linux-64-bit
{code}
The documentation for signal handling 
(http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html) 
defines the list of safe functions to call from a signal handler, you must not 
use any other functions, including backtrace. (search for async-signal-safe in 
that document)
{code}

So the fix should be similar with 
https://issues.apache.org/jira/browse/HAWQ-978. 

Need more time to verify the fix.  Thanks.

> Query cancel cause segment to go into Crash recovery
> 
>
> Key: HAWQ-1324
> URL: https://issues.apache.org/jira/browse/HAWQ-1324
> Project: Apache HAWQ
>  Issue Type: Bug
>Reporter: Ming LI
>Assignee: Ed Espino
>
> A query was cancelled due to this connection issue to HDFS on Isilon. Seg26 
> then went into crash recovery due to a INSERT query being cancelled. What 
> should be the expected behaviour when HDFS becomes unavailable and a Query 
> fails due to HDFS unavailability.
> Below is the HDFS error
> {code}
> 2017-01-04 03:04:08.382615 
> JST,"carund","dwhrun",p574246,th1862944896,"192.168.10.12","47554",2017-01-04 
> 03:03:08 JST,0,con198952,,seg29,"FATAL","08006","connection to client 
> lost",,,0,,"postgres.c",3518,
> 2017-01-04 03:04:08.420099 
> JST,,,p755778,th18629448960,,,seg-1,"LOG","0","3rd party 
> error log:
> 2017-01-04 03:04:08.419969, p574222, th140507423066240, ERROR Handle 
> Exception: NamenodeImpl.cpp: 670: Unexpected error: status: 
> STATUS_FILE_NOT_AVAILABLE = 0xC467 Path: 
> hawq_default/16385/16563/802748/26 with path=
> ""/hawq_default/16385/16563/802748/26"", 
> clientname=libhdfs3_client_random_866998528_count_1_pid_574222_tid_140507423066240
> @ Hdfs::Internal::UnWrapper Hdfs::HdfsIOException, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing , 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing>::unwrap(char const, int)
> @ Hdfs::Internal::UnWrapper Hdfs::UnresolvedLinkException, Hdfs::HdfsIOException, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Not hing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing>::unwrap(char const, int)
> @ Hdfs::Internal::NamenodeImpl::fsync(std::string const&, std::string const&)
> @ Hdfs::Internal::NamenodeProxy::fsync(std::string const&, std::string const&)
> @ Hdfs::Internal::OutputStreamImpl::closePipeline()
> @ Hdfs::Internal::OutputStreamImpl::close()
> @ hdfsCloseFile
> @ gpfs_hdfs_closefile
> @ HdfsCloseFile
> @ HdfsFileClose
> @ CleanupTempFiles
> @ AbortTransaction
> @ AbortCurrentTransaction
> @ PostgresMain
> @ BackendStartup
> @ ServerLoop
> @ PostmasterMain
> @ main
> @ Unknown
> @ Unknown""SysLoggerMain","syslogger.c",518,
> 2017-01-04 03:04:08.420272 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 
> JST,40678725,con198952,cmd4,seg25,,,x40678725,sx1,"WARNING","58030","could 
> not close file 7 : (hdfs://ffd
> lakehd.ffwin.fujifilm.co.jp:8020/hawq_default/16385/16563/802748/26) errno 
> 5","Unexpected error: status: STATUS_FILE_NOT_AVAILABLE = 0xC467 Path: 
> hawq_default/16385/16563/802748/26 with path=""/hawq_default/16385/16
> 563/802748/26"", 
> clientname=libhdfs3_client_random_866998528_count_1_pid_574222_tid_140507423066240",,0,,"fd.c",2762,
> {code}
> Segment 26 going into Crash recovery - from seg26 log file
> {code}
> 2017-01-04 03:04:08.420314 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 
> JST,40678725,con198952,cmd4,seg25,,,x40678725,sx1,"LOG","08006","could not 
> send data to client: 接続が相
> 手からリセットされました",,,0,,"pqcomm.c",1292,
> 2017-01-04 03:04:08.420358 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 JST,0,con198952,,seg25,"LOG","08006","could not send data to 
> client: パイプが切断されました",,,0,
> ,"pqcomm.c",1292,
> 2017-01-04 03:04:08.420375 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 JST,0,con198952,,seg25,"FATAL","08006","connection to client 
> lost",,,0,,"postgres.c",3518,
> 2017-01-04 03:04:08.950354 
> JST,,,p755773,th18629448960,,,seg-1,"LOG","0","server process 

[jira] [Assigned] (HAWQ-1324) Query cancel cause segment to go into Crash recovery

2017-02-13 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1324:
-

Assignee: Ming LI  (was: Ed Espino)

> Query cancel cause segment to go into Crash recovery
> 
>
> Key: HAWQ-1324
> URL: https://issues.apache.org/jira/browse/HAWQ-1324
> Project: Apache HAWQ
>  Issue Type: Bug
>Reporter: Ming LI
>Assignee: Ming LI
>
> A query was cancelled due to this connection issue to HDFS on Isilon. Seg26 
> then went into crash recovery due to a INSERT query being cancelled. What 
> should be the expected behaviour when HDFS becomes unavailable and a Query 
> fails due to HDFS unavailability.
> Below is the HDFS error
> {code}
> 2017-01-04 03:04:08.382615 
> JST,"carund","dwhrun",p574246,th1862944896,"192.168.10.12","47554",2017-01-04 
> 03:03:08 JST,0,con198952,,seg29,"FATAL","08006","connection to client 
> lost",,,0,,"postgres.c",3518,
> 2017-01-04 03:04:08.420099 
> JST,,,p755778,th18629448960,,,seg-1,"LOG","0","3rd party 
> error log:
> 2017-01-04 03:04:08.419969, p574222, th140507423066240, ERROR Handle 
> Exception: NamenodeImpl.cpp: 670: Unexpected error: status: 
> STATUS_FILE_NOT_AVAILABLE = 0xC467 Path: 
> hawq_default/16385/16563/802748/26 with path=
> ""/hawq_default/16385/16563/802748/26"", 
> clientname=libhdfs3_client_random_866998528_count_1_pid_574222_tid_140507423066240
> @ Hdfs::Internal::UnWrapper Hdfs::HdfsIOException, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing , 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing>::unwrap(char const, int)
> @ Hdfs::Internal::UnWrapper Hdfs::UnresolvedLinkException, Hdfs::HdfsIOException, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Not hing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing>::unwrap(char const, int)
> @ Hdfs::Internal::NamenodeImpl::fsync(std::string const&, std::string const&)
> @ Hdfs::Internal::NamenodeProxy::fsync(std::string const&, std::string const&)
> @ Hdfs::Internal::OutputStreamImpl::closePipeline()
> @ Hdfs::Internal::OutputStreamImpl::close()
> @ hdfsCloseFile
> @ gpfs_hdfs_closefile
> @ HdfsCloseFile
> @ HdfsFileClose
> @ CleanupTempFiles
> @ AbortTransaction
> @ AbortCurrentTransaction
> @ PostgresMain
> @ BackendStartup
> @ ServerLoop
> @ PostmasterMain
> @ main
> @ Unknown
> @ Unknown""SysLoggerMain","syslogger.c",518,
> 2017-01-04 03:04:08.420272 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 
> JST,40678725,con198952,cmd4,seg25,,,x40678725,sx1,"WARNING","58030","could 
> not close file 7 : (hdfs://ffd
> lakehd.ffwin.fujifilm.co.jp:8020/hawq_default/16385/16563/802748/26) errno 
> 5","Unexpected error: status: STATUS_FILE_NOT_AVAILABLE = 0xC467 Path: 
> hawq_default/16385/16563/802748/26 with path=""/hawq_default/16385/16
> 563/802748/26"", 
> clientname=libhdfs3_client_random_866998528_count_1_pid_574222_tid_140507423066240",,0,,"fd.c",2762,
> {code}
> Segment 26 going into Crash recovery - from seg26 log file
> {code}
> 2017-01-04 03:04:08.420314 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 
> JST,40678725,con198952,cmd4,seg25,,,x40678725,sx1,"LOG","08006","could not 
> send data to client: 接続が相
> 手からリセットされました",,,0,,"pqcomm.c",1292,
> 2017-01-04 03:04:08.420358 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 JST,0,con198952,,seg25,"LOG","08006","could not send data to 
> client: パイプが切断されました",,,0,
> ,"pqcomm.c",1292,
> 2017-01-04 03:04:08.420375 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 JST,0,con198952,,seg25,"FATAL","08006","connection to client 
> lost",,,0,,"postgres.c",3518,
> 2017-01-04 03:04:08.950354 
> JST,,,p755773,th18629448960,,,seg-1,"LOG","0","server process 
> (PID 574240) was terminated by signal 11: Segmentation 
> fault",,,0,,"postmaster.c",4748,
> 2017-01-04 03:04:08.950403 
> JST,,,p755773,th18629448960,,,seg-1,"LOG","0","terminating 
> any other active server processes",,,0,,"postmaster.c",4486,
> 2017-01-04 03:04:08.954044 
> JST,,,p41605,th18629448960,,,seg-1,"LOG","0","Segment RM 
> exits.",,,0,,"resourcemanager.c",340,
> 2017-01-04 03:04:08.954078 
> JST,,,p41605,th18629448960,,,seg-1,"LOG","0","Clean up 
> handler in message server is called.",,,0,,"rmcomm_MessageServer.c",105,
> 2017-01-04 03:04:08.972706 
> 

[jira] [Comment Edited] (HAWQ-1324) Query cancel cause segment to go into Crash recovery

2017-02-13 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863274#comment-15863274
 ] 

Ming LI edited comment on HAWQ-1324 at 2/13/17 7:14 AM:


The complete fix for this defect should be similar with PostgreSQL 9.6:
1) Function StatementCancelHandler() can not directly call ProcessInterrupts() 
which cascade call some unsafe functions. Only simple logic ( e.g. set variable 
flags) can be included in signal handler function,  and call 
ProcessInterrupts() wherever the code pointer the signal can be triggered and 
executed.
2) Forward all related fixes from postgresql to hawq should be very complex 
task. We currently just offer the fix which can not completely fix this kinds 
of crash, but it reduce the possibility of this kind of crash. 

Thanks.


was (Author: mli):
The compete fix for this defect should be similar with PostgreSQL 9.6:
1) Function StatementCancelHandler() can not directly call ProcessInterrupts() 
which cascade call some unsafe functions. Only simple logic ( e.g. set variable 
flags) can be included in signal handler function,  and call 
ProcessInterrupts() wherever the code pointer the signal can be triggered and 
executed.
2) Forward all related fixes from postgresql to hawq should be very complex 
task. We currently just offer the fix which can not completely fix this kinds 
of crash, but it reduce the possibility of this kind of crash. 

Thanks.

> Query cancel cause segment to go into Crash recovery
> 
>
> Key: HAWQ-1324
> URL: https://issues.apache.org/jira/browse/HAWQ-1324
> Project: Apache HAWQ
>  Issue Type: Bug
>Reporter: Ming LI
>Assignee: Ming LI
>
> A query was cancelled due to this connection issue to HDFS on Isilon. Seg26 
> then went into crash recovery due to a INSERT query being cancelled. What 
> should be the expected behaviour when HDFS becomes unavailable and a Query 
> fails due to HDFS unavailability.
> Below is the HDFS error
> {code}
> 2017-01-04 03:04:08.382615 
> JST,"carund","dwhrun",p574246,th1862944896,"192.168.10.12","47554",2017-01-04 
> 03:03:08 JST,0,con198952,,seg29,"FATAL","08006","connection to client 
> lost",,,0,,"postgres.c",3518,
> 2017-01-04 03:04:08.420099 
> JST,,,p755778,th18629448960,,,seg-1,"LOG","0","3rd party 
> error log:
> 2017-01-04 03:04:08.419969, p574222, th140507423066240, ERROR Handle 
> Exception: NamenodeImpl.cpp: 670: Unexpected error: status: 
> STATUS_FILE_NOT_AVAILABLE = 0xC467 Path: 
> hawq_default/16385/16563/802748/26 with path=
> ""/hawq_default/16385/16563/802748/26"", 
> clientname=libhdfs3_client_random_866998528_count_1_pid_574222_tid_140507423066240
> @ Hdfs::Internal::UnWrapper Hdfs::HdfsIOException, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing , 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing>::unwrap(char const, int)
> @ Hdfs::Internal::UnWrapper Hdfs::UnresolvedLinkException, Hdfs::HdfsIOException, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Not hing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing>::unwrap(char const, int)
> @ Hdfs::Internal::NamenodeImpl::fsync(std::string const&, std::string const&)
> @ Hdfs::Internal::NamenodeProxy::fsync(std::string const&, std::string const&)
> @ Hdfs::Internal::OutputStreamImpl::closePipeline()
> @ Hdfs::Internal::OutputStreamImpl::close()
> @ hdfsCloseFile
> @ gpfs_hdfs_closefile
> @ HdfsCloseFile
> @ HdfsFileClose
> @ CleanupTempFiles
> @ AbortTransaction
> @ AbortCurrentTransaction
> @ PostgresMain
> @ BackendStartup
> @ ServerLoop
> @ PostmasterMain
> @ main
> @ Unknown
> @ Unknown""SysLoggerMain","syslogger.c",518,
> 2017-01-04 03:04:08.420272 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 
> JST,40678725,con198952,cmd4,seg25,,,x40678725,sx1,"WARNING","58030","could 
> not close file 7 : (hdfs://ffd
> lakehd.ffwin.fujifilm.co.jp:8020/hawq_default/16385/16563/802748/26) errno 
> 5","Unexpected error: status: STATUS_FILE_NOT_AVAILABLE = 0xC467 Path: 
> hawq_default/16385/16563/802748/26 with path=""/hawq_default/16385/16
> 563/802748/26"", 
> clientname=libhdfs3_client_random_866998528_count_1_pid_574222_tid_140507423066240",,0,,"fd.c",2762,
> {code}
> Segment 26 going into Crash recovery - from seg26 log file
> {code}
> 2017-01-04 03:04:08.420314 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 
> JST,40678725,con198952,cmd4,seg25,,,x40678725,sx1,"LOG","08006","could not 
> send 

[jira] [Commented] (HAWQ-1324) Query cancel cause segment to go into Crash recovery

2017-02-13 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863274#comment-15863274
 ] 

Ming LI commented on HAWQ-1324:
---

The compete fix for this defect should be similar with PostgreSQL 9.6:
1) Function StatementCancelHandler() can not directly call ProcessInterrupts() 
which cascade call some unsafe functions. Only simple logic ( e.g. set variable 
flags) can be included in signal handler function,  and call 
ProcessInterrupts() wherever the code pointer the signal can be triggered and 
executed.
2) Forward all related fixes from postgresql to hawq should be very complex 
task. We currently just offer the fix which can not completely fix this kinds 
of crash, but it reduce the possibility of this kind of crash. 

Thanks.

> Query cancel cause segment to go into Crash recovery
> 
>
> Key: HAWQ-1324
> URL: https://issues.apache.org/jira/browse/HAWQ-1324
> Project: Apache HAWQ
>  Issue Type: Bug
>Reporter: Ming LI
>Assignee: Ed Espino
>
> A query was cancelled due to this connection issue to HDFS on Isilon. Seg26 
> then went into crash recovery due to a INSERT query being cancelled. What 
> should be the expected behaviour when HDFS becomes unavailable and a Query 
> fails due to HDFS unavailability.
> Below is the HDFS error
> {code}
> 2017-01-04 03:04:08.382615 
> JST,"carund","dwhrun",p574246,th1862944896,"192.168.10.12","47554",2017-01-04 
> 03:03:08 JST,0,con198952,,seg29,"FATAL","08006","connection to client 
> lost",,,0,,"postgres.c",3518,
> 2017-01-04 03:04:08.420099 
> JST,,,p755778,th18629448960,,,seg-1,"LOG","0","3rd party 
> error log:
> 2017-01-04 03:04:08.419969, p574222, th140507423066240, ERROR Handle 
> Exception: NamenodeImpl.cpp: 670: Unexpected error: status: 
> STATUS_FILE_NOT_AVAILABLE = 0xC467 Path: 
> hawq_default/16385/16563/802748/26 with path=
> ""/hawq_default/16385/16563/802748/26"", 
> clientname=libhdfs3_client_random_866998528_count_1_pid_574222_tid_140507423066240
> @ Hdfs::Internal::UnWrapper Hdfs::HdfsIOException, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing , 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing>::unwrap(char const, int)
> @ Hdfs::Internal::UnWrapper Hdfs::UnresolvedLinkException, Hdfs::HdfsIOException, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Not hing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing>::unwrap(char const, int)
> @ Hdfs::Internal::NamenodeImpl::fsync(std::string const&, std::string const&)
> @ Hdfs::Internal::NamenodeProxy::fsync(std::string const&, std::string const&)
> @ Hdfs::Internal::OutputStreamImpl::closePipeline()
> @ Hdfs::Internal::OutputStreamImpl::close()
> @ hdfsCloseFile
> @ gpfs_hdfs_closefile
> @ HdfsCloseFile
> @ HdfsFileClose
> @ CleanupTempFiles
> @ AbortTransaction
> @ AbortCurrentTransaction
> @ PostgresMain
> @ BackendStartup
> @ ServerLoop
> @ PostmasterMain
> @ main
> @ Unknown
> @ Unknown""SysLoggerMain","syslogger.c",518,
> 2017-01-04 03:04:08.420272 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 
> JST,40678725,con198952,cmd4,seg25,,,x40678725,sx1,"WARNING","58030","could 
> not close file 7 : (hdfs://ffd
> lakehd.ffwin.fujifilm.co.jp:8020/hawq_default/16385/16563/802748/26) errno 
> 5","Unexpected error: status: STATUS_FILE_NOT_AVAILABLE = 0xC467 Path: 
> hawq_default/16385/16563/802748/26 with path=""/hawq_default/16385/16
> 563/802748/26"", 
> clientname=libhdfs3_client_random_866998528_count_1_pid_574222_tid_140507423066240",,0,,"fd.c",2762,
> {code}
> Segment 26 going into Crash recovery - from seg26 log file
> {code}
> 2017-01-04 03:04:08.420314 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 
> JST,40678725,con198952,cmd4,seg25,,,x40678725,sx1,"LOG","08006","could not 
> send data to client: 接続が相
> 手からリセットされました",,,0,,"pqcomm.c",1292,
> 2017-01-04 03:04:08.420358 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 JST,0,con198952,,seg25,"LOG","08006","could not send data to 
> client: パイプが切断されました",,,0,
> ,"pqcomm.c",1292,
> 2017-01-04 03:04:08.420375 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 JST,0,con198952,,seg25,"FATAL","08006","connection to client 
> lost",,,0,,"postgres.c",3518,
> 2017-01-04 03:04:08.950354 
> JST,,,p755773,th18629448960,,,seg-1,"LOG","0","server process 
> (PID 574240) was terminated by signal 11: 

[jira] [Commented] (HAWQ-1278) Investigate installcheck-good issue on Mac OSX

2017-01-18 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827658#comment-15827658
 ] 

Ming LI commented on HAWQ-1278:
---

On macOS 10.12.2, after changed curl version from 7.51.0 to 7.43.0, the test 
errortbl ran clean.
So it should be a curl bug on this version of platform. Thanks.

> Investigate installcheck-good issue on Mac OSX
> --
>
> Key: HAWQ-1278
> URL: https://issues.apache.org/jira/browse/HAWQ-1278
> Project: Apache HAWQ
>  Issue Type: Task
>  Components: Tests
>Reporter: Ed Espino
>Assignee: Ruilong Huo
> Fix For: 2.2.0.0-incubating
>
>
> I am filing this as a place holder for the Mac OSX installcheck-good 
> investigation work.  Ming Li originally reported installcheck-good testing 
> issues with errtable and hcatalog_lookup test suites.
> This issue is not seen on CentOS 6 & 7 environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAWQ-398) Port Interconnect Hanging Bug Fix from GPDB to HAWQ

2016-12-24 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15774669#comment-15774669
 ] 

Ming LI commented on HAWQ-398:
--

This fix will be merged by HAWQ-1208 which porting many import fixes in gpdb 
inter connnector to hawq all in once.

> Port Interconnect Hanging Bug Fix from GPDB to HAWQ
> ---
>
> Key: HAWQ-398
> URL: https://issues.apache.org/jira/browse/HAWQ-398
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Interconnect
>Reporter: Lirong Jian
>Assignee: Lei Chang
> Fix For: backlog
>
>
> There is an known issue with UDP Interconnect, which has been fixed in GPDB: 
> https://github.com/greenplum-db/gpdb/pull/336/. We should port the fix to 
> HAWQ.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HAWQ-1195) Synchrony:Union not working on external tables ERROR:"Two or more external tables use the same error table ""xxxxxxx"" in a statement (execMain.c:274)"

2016-12-15 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI resolved HAWQ-1195.
---
Resolution: Fixed

> Synchrony:Union not working on external tables ERROR:"Two or more external 
> tables use the same error table ""xxx"" in a statement (execMain.c:274)"
> ---
>
> Key: HAWQ-1195
> URL: https://issues.apache.org/jira/browse/HAWQ-1195
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: External Tables
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: backlog
>
>
> Hello,
> User create an external table and define the error table. Then he do the 
> union on the same external table with different where condition. Then return 
> the error:ERROR:  Two or more external tables use the same error table 
> "err_ext_pdr_cdci_pivotal_request_43448" in a statement (execMain.c:274)
> Below is the master log when I reproduce: (the whole log file attached in 
> attachement)
> {code}
> 2016-11-29 22:49:51.976864 
> PST,"gpadmin","postgres",p769199,th-2123704032,"[local]",,2016-11-29 22:46:14 
> PST,1260,con72,cmd10,seg-1,,,x1260,sx1,"ERROR","XX000","Two or more external 
> tables use the same error table ""err_ext_pdr_cdci_pivotal_request_43448"" in 
> a statement (execMain.c:274)",,"select current_account_nbr,yearmonthint, 
> bank_name, first_date_open, max_cr_limit, care_credit_flag, cc1_flag, 
> partition_value, 'US' as loc from pdr_cdci_pivotal_request_43448 where 
> care_credit_flag<1
> union
> select current_account_nbr,yearmonthint, bank_name, first_date_open, 
> max_cr_limit, care_credit_flag, cc1_flag, partition_value, 'Non-US' as loc 
> from pdr_cdci_pivotal_request_43448 where 
> care_credit_flag=1;",0,,"execMain.c",274,"Stack trace:
> 10x8c5858 postgres errstart (??:0)
> 20x8c75db postgres elog_finish (??:0)
> 30x65f669 postgres  (??:0)
> 40x77d06a postgres walk_plan_node_fields (??:0)
> 50x77e3ee postgres plan_tree_walker (??:0)
> 60x77c70a postgres expression_tree_walker (??:0)
> 70x77e35d postgres plan_tree_walker (??:0)
> 80x77d06a postgres walk_plan_node_fields (??:0)
> 90x77dfe6 postgres plan_tree_walker (??:0)
> 10   0x77d06a postgres walk_plan_node_fields (??:0)
> 11   0x77e1e5 postgres plan_tree_walker (??:0)
> 12   0x77d06a postgres walk_plan_node_fields (??:0)
> 13   0x77dfe6 postgres plan_tree_walker (??:0)
> 14   0x77d06a postgres walk_plan_node_fields (??:0)
> 15   0x77e1e5 postgres plan_tree_walker (??:0)
> 16   0x66079b postgres ExecutorStart (??:0)
> 17   0x7ebf1d postgres PortalStart (??:0)
> 18   0x7e4288 postgres  (??:0)
> 19   0x7e54c2 postgres PostgresMain (??:0)
> 20   0x797d50 postgres  (??:0)
> 21   0x79ab19 postgres PostmasterMain (??:0)
> 22   0x4a4069 postgres main (??:0)
> 23   0x7fd97d486d5d libc.so.6 __libc_start_main (??:0)
> 24   0x4a40e9 postgres  (??:0)
> "
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (HAWQ-1195) Synchrony:Union not working on external tables ERROR:"Two or more external tables use the same error table ""xxxxxxx"" in a statement (execMain.c:274)"

2016-12-12 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reopened HAWQ-1195:
---

Revert this commit because running installcheck-good failed.
It is my fault, I had thought that we have move all testcases in 
installcheck-good to featuretest, so I only ran feature test.

Need more time to investigate why it conflict at the same relation file number.

> Synchrony:Union not working on external tables ERROR:"Two or more external 
> tables use the same error table ""xxx"" in a statement (execMain.c:274)"
> ---
>
> Key: HAWQ-1195
> URL: https://issues.apache.org/jira/browse/HAWQ-1195
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: External Tables
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: backlog
>
>
> Hello,
> User create an external table and define the error table. Then he do the 
> union on the same external table with different where condition. Then return 
> the error:ERROR:  Two or more external tables use the same error table 
> "err_ext_pdr_cdci_pivotal_request_43448" in a statement (execMain.c:274)
> Below is the master log when I reproduce: (the whole log file attached in 
> attachement)
> {code}
> 2016-11-29 22:49:51.976864 
> PST,"gpadmin","postgres",p769199,th-2123704032,"[local]",,2016-11-29 22:46:14 
> PST,1260,con72,cmd10,seg-1,,,x1260,sx1,"ERROR","XX000","Two or more external 
> tables use the same error table ""err_ext_pdr_cdci_pivotal_request_43448"" in 
> a statement (execMain.c:274)",,"select current_account_nbr,yearmonthint, 
> bank_name, first_date_open, max_cr_limit, care_credit_flag, cc1_flag, 
> partition_value, 'US' as loc from pdr_cdci_pivotal_request_43448 where 
> care_credit_flag<1
> union
> select current_account_nbr,yearmonthint, bank_name, first_date_open, 
> max_cr_limit, care_credit_flag, cc1_flag, partition_value, 'Non-US' as loc 
> from pdr_cdci_pivotal_request_43448 where 
> care_credit_flag=1;",0,,"execMain.c",274,"Stack trace:
> 10x8c5858 postgres errstart (??:0)
> 20x8c75db postgres elog_finish (??:0)
> 30x65f669 postgres  (??:0)
> 40x77d06a postgres walk_plan_node_fields (??:0)
> 50x77e3ee postgres plan_tree_walker (??:0)
> 60x77c70a postgres expression_tree_walker (??:0)
> 70x77e35d postgres plan_tree_walker (??:0)
> 80x77d06a postgres walk_plan_node_fields (??:0)
> 90x77dfe6 postgres plan_tree_walker (??:0)
> 10   0x77d06a postgres walk_plan_node_fields (??:0)
> 11   0x77e1e5 postgres plan_tree_walker (??:0)
> 12   0x77d06a postgres walk_plan_node_fields (??:0)
> 13   0x77dfe6 postgres plan_tree_walker (??:0)
> 14   0x77d06a postgres walk_plan_node_fields (??:0)
> 15   0x77e1e5 postgres plan_tree_walker (??:0)
> 16   0x66079b postgres ExecutorStart (??:0)
> 17   0x7ebf1d postgres PortalStart (??:0)
> 18   0x7e4288 postgres  (??:0)
> 19   0x7e54c2 postgres PostgresMain (??:0)
> 20   0x797d50 postgres  (??:0)
> 21   0x79ab19 postgres PostmasterMain (??:0)
> 22   0x4a4069 postgres main (??:0)
> 23   0x7fd97d486d5d libc.so.6 __libc_start_main (??:0)
> 24   0x4a40e9 postgres  (??:0)
> "
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HAWQ-1195) Synchrony:Union not working on external tables ERROR:"Two or more external tables use the same error table ""xxxxxxx"" in a statement (execMain.c:274)"

2016-12-11 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI resolved HAWQ-1195.
---
   Resolution: Fixed
Fix Version/s: backlog

> Synchrony:Union not working on external tables ERROR:"Two or more external 
> tables use the same error table ""xxx"" in a statement (execMain.c:274)"
> ---
>
> Key: HAWQ-1195
> URL: https://issues.apache.org/jira/browse/HAWQ-1195
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: External Tables
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: backlog
>
>
> Hello,
> User create an external table and define the error table. Then he do the 
> union on the same external table with different where condition. Then return 
> the error:ERROR:  Two or more external tables use the same error table 
> "err_ext_pdr_cdci_pivotal_request_43448" in a statement (execMain.c:274)
> Below is the master log when I reproduce: (the whole log file attached in 
> attachement)
> {code}
> 2016-11-29 22:49:51.976864 
> PST,"gpadmin","postgres",p769199,th-2123704032,"[local]",,2016-11-29 22:46:14 
> PST,1260,con72,cmd10,seg-1,,,x1260,sx1,"ERROR","XX000","Two or more external 
> tables use the same error table ""err_ext_pdr_cdci_pivotal_request_43448"" in 
> a statement (execMain.c:274)",,"select current_account_nbr,yearmonthint, 
> bank_name, first_date_open, max_cr_limit, care_credit_flag, cc1_flag, 
> partition_value, 'US' as loc from pdr_cdci_pivotal_request_43448 where 
> care_credit_flag<1
> union
> select current_account_nbr,yearmonthint, bank_name, first_date_open, 
> max_cr_limit, care_credit_flag, cc1_flag, partition_value, 'Non-US' as loc 
> from pdr_cdci_pivotal_request_43448 where 
> care_credit_flag=1;",0,,"execMain.c",274,"Stack trace:
> 10x8c5858 postgres errstart (??:0)
> 20x8c75db postgres elog_finish (??:0)
> 30x65f669 postgres  (??:0)
> 40x77d06a postgres walk_plan_node_fields (??:0)
> 50x77e3ee postgres plan_tree_walker (??:0)
> 60x77c70a postgres expression_tree_walker (??:0)
> 70x77e35d postgres plan_tree_walker (??:0)
> 80x77d06a postgres walk_plan_node_fields (??:0)
> 90x77dfe6 postgres plan_tree_walker (??:0)
> 10   0x77d06a postgres walk_plan_node_fields (??:0)
> 11   0x77e1e5 postgres plan_tree_walker (??:0)
> 12   0x77d06a postgres walk_plan_node_fields (??:0)
> 13   0x77dfe6 postgres plan_tree_walker (??:0)
> 14   0x77d06a postgres walk_plan_node_fields (??:0)
> 15   0x77e1e5 postgres plan_tree_walker (??:0)
> 16   0x66079b postgres ExecutorStart (??:0)
> 17   0x7ebf1d postgres PortalStart (??:0)
> 18   0x7e4288 postgres  (??:0)
> 19   0x7e54c2 postgres PostgresMain (??:0)
> 20   0x797d50 postgres  (??:0)
> 21   0x79ab19 postgres PostmasterMain (??:0)
> 22   0x4a4069 postgres main (??:0)
> 23   0x7fd97d486d5d libc.so.6 __libc_start_main (??:0)
> 24   0x4a40e9 postgres  (??:0)
> "
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HAWQ-1208) Porting gpdb interconnect fix to hawq

2016-12-09 Thread Ming LI (JIRA)
Ming LI created HAWQ-1208:
-

 Summary: Porting gpdb interconnect fix to hawq
 Key: HAWQ-1208
 URL: https://issues.apache.org/jira/browse/HAWQ-1208
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Interconnect
Reporter: Ming LI
Assignee: Lei Chang


Port interconnect fix in gpdb to hawq so that random fail in interconnect suite 
can be avoided potentially. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HAWQ-1208) Porting gpdb interconnect fix to hawq

2016-12-09 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1208:
-

Assignee: Ming LI  (was: Lei Chang)

> Porting gpdb interconnect fix to hawq
> -
>
> Key: HAWQ-1208
> URL: https://issues.apache.org/jira/browse/HAWQ-1208
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Interconnect
>Reporter: Ming LI
>Assignee: Ming LI
>
> Port interconnect fix in gpdb to hawq so that random fail in interconnect 
> suite can be avoided potentially. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HAWQ-1195) Synchrony:Union not working on external tables ERROR:"Two or more external tables use the same error table ""xxxxxxx"" in a statement (execMain.c:274)"

2016-12-06 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1195:
-

Assignee: Ming LI  (was: Lei Chang)

> Synchrony:Union not working on external tables ERROR:"Two or more external 
> tables use the same error table ""xxx"" in a statement (execMain.c:274)"
> ---
>
> Key: HAWQ-1195
> URL: https://issues.apache.org/jira/browse/HAWQ-1195
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: External Tables
>Reporter: Ming LI
>Assignee: Ming LI
>
> Hello,
> User create an external table and define the error table. Then he do the 
> union on the same external table with different where condition. Then return 
> the error:ERROR:  Two or more external tables use the same error table 
> "err_ext_pdr_cdci_pivotal_request_43448" in a statement (execMain.c:274)
> Below is the master log when I reproduce: (the whole log file attached in 
> attachement)
> {code}
> 2016-11-29 22:49:51.976864 
> PST,"gpadmin","postgres",p769199,th-2123704032,"[local]",,2016-11-29 22:46:14 
> PST,1260,con72,cmd10,seg-1,,,x1260,sx1,"ERROR","XX000","Two or more external 
> tables use the same error table ""err_ext_pdr_cdci_pivotal_request_43448"" in 
> a statement (execMain.c:274)",,"select current_account_nbr,yearmonthint, 
> bank_name, first_date_open, max_cr_limit, care_credit_flag, cc1_flag, 
> partition_value, 'US' as loc from pdr_cdci_pivotal_request_43448 where 
> care_credit_flag<1
> union
> select current_account_nbr,yearmonthint, bank_name, first_date_open, 
> max_cr_limit, care_credit_flag, cc1_flag, partition_value, 'Non-US' as loc 
> from pdr_cdci_pivotal_request_43448 where 
> care_credit_flag=1;",0,,"execMain.c",274,"Stack trace:
> 10x8c5858 postgres errstart (??:0)
> 20x8c75db postgres elog_finish (??:0)
> 30x65f669 postgres  (??:0)
> 40x77d06a postgres walk_plan_node_fields (??:0)
> 50x77e3ee postgres plan_tree_walker (??:0)
> 60x77c70a postgres expression_tree_walker (??:0)
> 70x77e35d postgres plan_tree_walker (??:0)
> 80x77d06a postgres walk_plan_node_fields (??:0)
> 90x77dfe6 postgres plan_tree_walker (??:0)
> 10   0x77d06a postgres walk_plan_node_fields (??:0)
> 11   0x77e1e5 postgres plan_tree_walker (??:0)
> 12   0x77d06a postgres walk_plan_node_fields (??:0)
> 13   0x77dfe6 postgres plan_tree_walker (??:0)
> 14   0x77d06a postgres walk_plan_node_fields (??:0)
> 15   0x77e1e5 postgres plan_tree_walker (??:0)
> 16   0x66079b postgres ExecutorStart (??:0)
> 17   0x7ebf1d postgres PortalStart (??:0)
> 18   0x7e4288 postgres  (??:0)
> 19   0x7e54c2 postgres PostgresMain (??:0)
> 20   0x797d50 postgres  (??:0)
> 21   0x79ab19 postgres PostmasterMain (??:0)
> 22   0x4a4069 postgres main (??:0)
> 23   0x7fd97d486d5d libc.so.6 __libc_start_main (??:0)
> 24   0x4a40e9 postgres  (??:0)
> "
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HAWQ-1195) Synchrony:Union not working on external tables ERROR:"Two or more external tables use the same error table ""xxxxxxx"" in a statement (execMain.c:274)"

2016-12-06 Thread Ming LI (JIRA)
Ming LI created HAWQ-1195:
-

 Summary: Synchrony:Union not working on external tables ERROR:"Two 
or more external tables use the same error table ""xxx"" in a statement 
(execMain.c:274)"
 Key: HAWQ-1195
 URL: https://issues.apache.org/jira/browse/HAWQ-1195
 Project: Apache HAWQ
  Issue Type: Bug
  Components: External Tables
Reporter: Ming LI
Assignee: Lei Chang


Hello,

User create an external table and define the error table. Then he do the union 
on the same external table with different where condition. Then return the 
error:ERROR:  Two or more external tables use the same error table 
"err_ext_pdr_cdci_pivotal_request_43448" in a statement (execMain.c:274)

Below is the master log when I reproduce: (the whole log file attached in 
attachement)
{code}
2016-11-29 22:49:51.976864 
PST,"gpadmin","postgres",p769199,th-2123704032,"[local]",,2016-11-29 22:46:14 
PST,1260,con72,cmd10,seg-1,,,x1260,sx1,"ERROR","XX000","Two or more external 
tables use the same error table ""err_ext_pdr_cdci_pivotal_request_43448"" in a 
statement (execMain.c:274)",,"select current_account_nbr,yearmonthint, 
bank_name, first_date_open, max_cr_limit, care_credit_flag, cc1_flag, 
partition_value, 'US' as loc from pdr_cdci_pivotal_request_43448 where 
care_credit_flag<1
union
select current_account_nbr,yearmonthint, bank_name, first_date_open, 
max_cr_limit, care_credit_flag, cc1_flag, partition_value, 'Non-US' as loc from 
pdr_cdci_pivotal_request_43448 where 
care_credit_flag=1;",0,,"execMain.c",274,"Stack trace:
10x8c5858 postgres errstart (??:0)
20x8c75db postgres elog_finish (??:0)
30x65f669 postgres  (??:0)
40x77d06a postgres walk_plan_node_fields (??:0)
50x77e3ee postgres plan_tree_walker (??:0)
60x77c70a postgres expression_tree_walker (??:0)
70x77e35d postgres plan_tree_walker (??:0)
80x77d06a postgres walk_plan_node_fields (??:0)
90x77dfe6 postgres plan_tree_walker (??:0)
10   0x77d06a postgres walk_plan_node_fields (??:0)
11   0x77e1e5 postgres plan_tree_walker (??:0)
12   0x77d06a postgres walk_plan_node_fields (??:0)
13   0x77dfe6 postgres plan_tree_walker (??:0)
14   0x77d06a postgres walk_plan_node_fields (??:0)
15   0x77e1e5 postgres plan_tree_walker (??:0)
16   0x66079b postgres ExecutorStart (??:0)
17   0x7ebf1d postgres PortalStart (??:0)
18   0x7e4288 postgres  (??:0)
19   0x7e54c2 postgres PostgresMain (??:0)
20   0x797d50 postgres  (??:0)
21   0x79ab19 postgres PostmasterMain (??:0)
22   0x4a4069 postgres main (??:0)
23   0x7fd97d486d5d libc.so.6 __libc_start_main (??:0)
24   0x4a40e9 postgres  (??:0)
"
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (HAWQ-1149) Built-in function gp_persistent_build_all loses data in gp_relfile_node and gp_persistent_relfile_node

2016-11-21 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI closed HAWQ-1149.
-
   Resolution: Fixed
Fix Version/s: backlog

> Built-in function gp_persistent_build_all loses data in gp_relfile_node and 
> gp_persistent_relfile_node
> --
>
> Key: HAWQ-1149
> URL: https://issues.apache.org/jira/browse/HAWQ-1149
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Chunling Wang
>Assignee: Lei Chang
> Fix For: backlog
>
>
> When we create a new table, and insert data into it. There will be records in 
> gp_relfile_node, gp_persistent_relfile_node and gp_persistent_relation_node. 
> But if we run the HAWQ build-in function gp_persistent_build_all, we will 
> find that the record in gp_relfile_node and gp_persistent_relfile_node for 
> this table is lost. And if there are more than 1 file in this talbe, we will 
> get error when we drop this table. Here are the steps to recur this bug:
> 1. Create table a, and insert data into a with two concurrent process:
> {code}
> postgres=# create table a(id int);
> CREATE TABLE
> postgres=# insert into a select generate_series(1, 1000);
> INSERT 0 1000
> {code}
> {code}
> postgres=# insert into a select generate_series(1000, 2000);
> INSERT 0 1001
> {code}
> 2. Check the persistent table and find two files in this table's directory:
> {code}
> postgres=# select oid from pg_class where relname='a';
>oid
> -
>  3017232
> (1 row)
> postgres=# select * from gp_relfile_node where relfilenode_oid=3017232;
>  relfilenode_oid | segment_file_num | persistent_tid | persistent_serial_num
> -+--++---
>  3017232 |1 | (4,128)|855050
>  3017232 |2 | (4,129)|855051
> (2 rows)
> postgres=# select * from gp_persistent_relation_node where 
> relfilenode_oid=3017232;
>  tablespace_oid | database_oid | relfilenode_oid | persistent_state | 
> reserved | parent_xid | persistent_serial_num | previous_free_tid
> +--+-+--+--++---+---
>   16385 |16387 | 3017232 |2 |
> 0 |  0 |158943 | (0,0)
> (1 row)
> postgres=# select * from gp_persistent_relfile_node where 
> relfilenode_oid=3017232;
>  tablespace_oid | database_oid | relfilenode_oid | segment_file_num | 
> relation_storage_manager | persistent_state | relation_bufpool_kind | 
> parent_xid | persistent_serial_num | previous_free_tid
> +--+-+--+--+--+---++---+---
>   16385 |16387 | 3017232 |1 | 
>2 |2 | 0 |  0 |
> 855050 | (0,0)
>   16385 |16387 | 3017232 |2 | 
>2 |2 | 0 |  0 |
> 855051 | (0,0)
> (2 rows)
> hadoop fs -ls /hawq_default/16385/16387/3017232
> -rw---   3 wangchunling supergroup  100103584 2016-11-08 17:02 
> /hawq_default/16385/16387/3017232/1
> -rw---   3 wangchunling supergroup  100103600 2016-11-08 17:02 
> /hawq_default/16385/16387/3017232/2
> {code}
> 3. Rebuilt persistent tables.
> {code}
> postgres=# insert into a select generate_series(1000, 2000);
> INSERT 0 1001
> postgres=# select gp_persistent_reset_all();
>  gp_persistent_reset_all
> -
>1
> (1 row)
> postgres=# select gp_persistent_build_all(false);
>  gp_persistent_build_all
> -
>1
> (1 row)
> {code}
> 4. Check persistent table and find data lost in gp_relfile_node and 
> gp_persistent_relfile_node.
> {code}
> postgres=# select * from gp_relfile_node where relfilenode_oid=3017232;
>  relfilenode_oid | segment_file_num | persistent_tid | persistent_serial_num
> -+--++---
> (0 rows)
> postgres=# select * from gp_persistent_relation_node where 
> relfilenode_oid=3017232;
>  tablespace_oid | database_oid | relfilenode_oid | persistent_state | 
> reserved | parent_xid | persistent_serial_num | previous_free_tid
> +--+-+--+--++---+---
>   16385 |16387 | 

[jira] [Commented] (HAWQ-1149) Built-in function gp_persistent_build_all loses data in gp_relfile_node and gp_persistent_relfile_node

2016-11-14 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15663116#comment-15663116
 ] 

Ming LI commented on HAWQ-1149:
---

Internally there are 3 bugs:
1) hawq2.0 changed the file path of relation file on hdfs from 
filespace/db/relfile.filenum to filespace/db/table/relfile/filenum. When we 
scan relfile on hdfs, we should change scan logic correspondingly.
2) Fetched dummy persistentTid & persistentSerialNum to 
PersistentRelation_MarkCreatePending() .
3) Need to reset Relation->rd_relationnodeinfo.isPresent, so that next time 
persistentid and serial# can be refetched during PersistentBuild_BuildDb().

> Built-in function gp_persistent_build_all loses data in gp_relfile_node and 
> gp_persistent_relfile_node
> --
>
> Key: HAWQ-1149
> URL: https://issues.apache.org/jira/browse/HAWQ-1149
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Chunling Wang
>Assignee: Lei Chang
>
> When we create a new table, and insert data into it. There will be records in 
> gp_relfile_node, gp_persistent_relfile_node and gp_persistent_relation_node. 
> But if we run the HAWQ build-in function gp_persistent_build_all, we will 
> find that the record in gp_relfile_node and gp_persistent_relfile_node for 
> this table is lost. And if there are more than 1 file in this talbe, we will 
> get error when we drop this table. Here are the steps to recur this bug:
> 1. Create table a, and insert data into a with two concurrent process:
> {code}
> postgres=# create table a(id int);
> CREATE TABLE
> postgres=# insert into a select generate_series(1, 1000);
> INSERT 0 1000
> {code}
> {code}
> postgres=# insert into a select generate_series(1000, 2000);
> INSERT 0 1001
> {code}
> 2. Check the persistent table and find two files in this table's directory:
> {code}
> postgres=# select oid from pg_class where relname='a';
>oid
> -
>  3017232
> (1 row)
> postgres=# select * from gp_relfile_node where relfilenode_oid=3017232;
>  relfilenode_oid | segment_file_num | persistent_tid | persistent_serial_num
> -+--++---
>  3017232 |1 | (4,128)|855050
>  3017232 |2 | (4,129)|855051
> (2 rows)
> postgres=# select * from gp_persistent_relation_node where 
> relfilenode_oid=3017232;
>  tablespace_oid | database_oid | relfilenode_oid | persistent_state | 
> reserved | parent_xid | persistent_serial_num | previous_free_tid
> +--+-+--+--++---+---
>   16385 |16387 | 3017232 |2 |
> 0 |  0 |158943 | (0,0)
> (1 row)
> postgres=# select * from gp_persistent_relfile_node where 
> relfilenode_oid=3017232;
>  tablespace_oid | database_oid | relfilenode_oid | segment_file_num | 
> relation_storage_manager | persistent_state | relation_bufpool_kind | 
> parent_xid | persistent_serial_num | previous_free_tid
> +--+-+--+--+--+---++---+---
>   16385 |16387 | 3017232 |1 | 
>2 |2 | 0 |  0 |
> 855050 | (0,0)
>   16385 |16387 | 3017232 |2 | 
>2 |2 | 0 |  0 |
> 855051 | (0,0)
> (2 rows)
> hadoop fs -ls /hawq_default/16385/16387/3017232
> -rw---   3 wangchunling supergroup  100103584 2016-11-08 17:02 
> /hawq_default/16385/16387/3017232/1
> -rw---   3 wangchunling supergroup  100103600 2016-11-08 17:02 
> /hawq_default/16385/16387/3017232/2
> {code}
> 3. Rebuilt persistent tables.
> {code}
> postgres=# insert into a select generate_series(1000, 2000);
> INSERT 0 1001
> postgres=# select gp_persistent_reset_all();
>  gp_persistent_reset_all
> -
>1
> (1 row)
> postgres=# select gp_persistent_build_all(false);
>  gp_persistent_build_all
> -
>1
> (1 row)
> {code}
> 4. Check persistent table and find data lost in gp_relfile_node and 
> gp_persistent_relfile_node.
> {code}
> postgres=# select * from gp_relfile_node where relfilenode_oid=3017232;
>  relfilenode_oid | segment_file_num | persistent_tid | persistent_serial_num
> 

[jira] [Commented] (HAWQ-968) Incorrect free in url_fclose

2016-11-07 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15643458#comment-15643458
 ] 

Ming LI commented on HAWQ-968:
--

Hi Hong, 

Could you share with me more info? I went thought the code in url_fclose(), 
although there are many code entry for calling "free(file); ", but they are all 
just one-time free.  Thanks. 

Specially FYI. below code snippet only call free(file) when report error, after 
that it will longjmp to other place. So it only just call one-time. 
{code}
if (failOnError) free(file);
ereport( (failOnError ? ERROR : LOG)  
{code}

> Incorrect free in url_fclose
> 
>
> Key: HAWQ-968
> URL: https://issues.apache.org/jira/browse/HAWQ-968
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: External Tables
>Affects Versions: backlog
>Reporter: hongwu
>Assignee: hongwu
>Priority: Minor
> Fix For: backlog
>
>
> There is potential double free risk in 
> url_fclose(https://github.com/apache/incubator-hawq/blob/master/src/backend/access/external/url.c#L1161)
>  of url.c.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HAWQ-968) Incorrect free in url_fclose

2016-11-06 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI updated HAWQ-968:
-
Summary: Incorrect free in url_fclose  (was: Incorrect free in url_close)

> Incorrect free in url_fclose
> 
>
> Key: HAWQ-968
> URL: https://issues.apache.org/jira/browse/HAWQ-968
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: External Tables
>Affects Versions: backlog
>Reporter: hongwu
>Assignee: hongwu
> Fix For: backlog
>
>
> There is potential double free risk in 
> url_fclose(https://github.com/apache/incubator-hawq/blob/master/src/backend/access/external/url.c#L1161)
>  of url.c.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HAWQ-1038) Missing BPCHAR in Data Type

2016-11-06 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI resolved HAWQ-1038.
---
Resolution: Won't Fix

> Missing BPCHAR in Data Type
> ---
>
> Key: HAWQ-1038
> URL: https://issues.apache.org/jira/browse/HAWQ-1038
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Goden Yao
>Assignee: David Yozie
> Fix For: backlog
>
>
> referring to 3rd party site:
> http://hdb.docs.pivotal.io/20/reference/catalog/pg_type.html 
> and 
> http://hdb.docs.pivotal.io/20/reference/HAWQDataTypes.html
> It's quite out of date if you check source code:
> https://github.com/apache/incubator-hawq/blob/master/src/interfaces/ecpg/ecpglib/pg_type.h
> {code}
> ...
> #define BPCHAROID 1042
> ...
> {code}
> We at least miss BPCHAR in the type table, maybe more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAWQ-1038) Missing BPCHAR in Data Type

2016-11-06 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15643015#comment-15643015
 ] 

Ming LI commented on HAWQ-1038:
---

Some points may need your concern:
1) HAWQ basing on PostgreSQL 8.2.15, most of the types are inherited from it, 
except array. You don't need to compare with PostgreSQL9.5, because we don't 
merge postgresql code changes.
2) All types below you listed in google docs are internal types, below is some 
types I know:
name: type for specific db object name, which is char (63)
int2vector: array of int2, used in system tables
tid: tuple ID
xid: transaction ID
cid: command ID
oidvector: array of oids
bpchar: blank-padded string, fixed storage length
You can see the description in below source code:
https://github.com/apache/incubator-hawq/blob/master/src/include/catalog/pg_type.h
So I don't think we need to change the doc at present without supporting any 
more types for user.

> Missing BPCHAR in Data Type
> ---
>
> Key: HAWQ-1038
> URL: https://issues.apache.org/jira/browse/HAWQ-1038
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Goden Yao
>Assignee: David Yozie
> Fix For: backlog
>
>
> referring to 3rd party site:
> http://hdb.docs.pivotal.io/20/reference/catalog/pg_type.html 
> and 
> http://hdb.docs.pivotal.io/20/reference/HAWQDataTypes.html
> It's quite out of date if you check source code:
> https://github.com/apache/incubator-hawq/blob/master/src/interfaces/ecpg/ecpglib/pg_type.h
> {code}
> ...
> #define BPCHAROID 1042
> ...
> {code}
> We at least miss BPCHAR in the type table, maybe more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAWQ-1135) MADlib: Raising exception leads to database connection termination

2016-11-01 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627383#comment-15627383
 ] 

Ming LI commented on HAWQ-1135:
---

1) Execute SQL:
select madlib.lmf_igd_run(
  'LMF_output_table'::varchar,
  'madlibtestdata.mlens100k'::varchar,
  'user_id'::varchar,
  'movie_id'::varchar,
  'rating'::varchar,
  943::integer,
  1682::integer,
  10::integer,
  0.01::double precision,
  NULL::double precision,
  10::integer,
  1e-3::double precision
  );
  

2) gdb executor process:
b planner
c
c 17
b siglongjmp


(gdb) f 1
(gdb) p *edata
$2 = {elevel = 20, output_to_server = 1 '\001', output_to_client = 1 '\001', 
show_funcname = 0 '\000',
  omit_location = 1 '\001', fatal_return = 0 '\000', hide_stmt = 0 '\000', 
send_alert = 0 '\000',
  filename = 0x7f68bf30fd08 
"/data/home/gpdbchina/madlib-1.9.1-build/incubator-madlib/src/ports/hawq/../greenplum/dbconnector/../../postgres/dbconnector/UDF_impl.hpp",
 lineno = 210,
  funcname = 0x7f68bf3104d4 "call", domain = 0xc6b0bb "postgres-8.2", 
sqlerrcode = 50856066,
  message = 0x1c42950 "Function \"madlib.lmf_igd_transition(double 
precision[],integer,integer,double precision,double 
precision[],integer,integer,integer,double precision,double precision)\": 
Invalid type conversion. Null wh"..., detail = 0x0, detail_log = 0x0, hint = 
0x0,
  context = 0x1c42eb0 "SQL statement \"\n", ' ' , "SELECT\n", 
' ' , "1 AS _iteration,\n", ' ' , "(\n", ' 
' , "SELECT\n", ' ' , 
"madlib.lmf_igd_step(\n", ' ' , "(_src.user_id)"..., 
cursorpos = 0,
  internalpos = 0, internalquery = 0x0, saved_errno = 11, stacktracearray = 
{0x95d52a, 0x7f68bf2382e9,
0x6fb00b, 0x6faeab, 0x6fb394, 0x6fc17d, 0x6fce08, 0x6fc08a, 0x6df92f, 
0x7183ff, 0xaab907, 0x6d608a,
0x72f44d, 0x72ef58, 0x72c1d9, 0x7f68c44a8011, 0x7f68c44a7a54, 
0x7f68c41ca9d4, 0x7f68c41cc647,
0x7f68c41caa94, 0x7f68c41cc647, 0x7f68c415fd9d, 0x7f68c4138c63, 
0x7f68c41c9460, 0x7f68c41cbb7f,
0x7f68c41cc647, 0x7f68c41cc722, 0x7f68c44a1e97, 0x7f68c44a155c, 
0x7f68c449faf3},
  stacktracesize = 30, printstack = 0 '\000'}
(gdb) bt


#0  0x003dc100e150 in siglongjmp () from /lib64/libpthread.so.0
#1  0x0095d6f7 in errfinish (dummy=0) at elog.c:578
#2  0x7f68bf238321 in long 
madlib::dbconnector::postgres::UDF::call(FunctionCallInfoData*)
 ()
   from /usr/local/madlib/Versions/1.9.1/ports/hawq/2.0/lib/libmadlib.so
#3  0x006fb00b in invoke_agg_trans_func (transfn=0x2f7cff8, numargs=9, 
transValue=37836256,
noTransvalue=0x2f7a431 "", transValueIsNull=0x2f7a430 "", transtypeByVal=0 
'\000', transtypeLen=-1,
fcinfo=0x7fff61be3500, funcctx=0x2fb5b28, tuplecontext=0x244d0b0, 
mem_manager=0x2fb5e00)
at nodeAgg.c:471
#4  0x006faeab in advance_transition_function (aggstate=0x2fb5b28, 
peraggstate=0x2f7cfc0,
pergroupstate=0x2f7a428, fcinfo=0x7fff61be3500, mem_manager=0x2fb5e00) at 
nodeAgg.c:392
#5  0x006fb394 in advance_aggregates (aggstate=0x2fb5b28, 
pergroup=0x2f7a428,
mem_manager=0x2fb5e00) at nodeAgg.c:618
#6  0x006fc17d in agg_retrieve_scalar (aggstate=0x2fb5b28) at 
nodeAgg.c:1173
#7  0x006fce08 in agg_retrieve_direct (aggstate=0x2fb5b28) at 
nodeAgg.c:1693
#8  0x006fc08a in ExecAgg (node=0x2fb5b28) at nodeAgg.c:1138
#9  0x006df92f in ExecProcNode (node=0x2fb5b28) at execProcnode.c:979
#10 0x007183ff in ExecSetParamPlan (node=0x2fb5808, econtext=0x2f74a10, 
gbl_queryDesc=0x2fa8550)
at nodeSubplan.c:1161
#11 0x00aab907 in preprocess_initplans (queryDesc=0x2fa8550) at 
cdbsubplan.c:171
#12 0x006d608a in ExecutorStart (queryDesc=0x2fa8550, eflags=0) at 
execMain.c:929
#13 0x0072f44d in _SPI_pquery (queryDesc=0x2fa8550, fire_triggers=1 
'\001', tcount=0)
at spi.c:2214
#14 0x0072ef58 in _SPI_execute_plan (plan=0x242be00, Values=0x244c108,
Nulls=0x21e5a78 "notice", snapshot=0x0, crosscheck_snapshot=0x0, 
read_only=0 '\000',
fire_triggers=1 '\001', tcount=0) at spi.c:1972
#15 0x0072c1d9 in SPI_execute_plan (plan=0x242be00, Values=0x244c108, 
Nulls=0x21e5a78 "notice",
read_only=0 '\000', tcount=0) at spi.c:520
#16 0x7f68c44a8011 in PLy_spi_execute_plan (ob=0x228c8b8, list=0x222e3b0, 
limit=0)
at plpython.c:3737
#17 0x7f68c44a7a54 in PLy_spi_execute (self=0x0, args=0x2221098) at 
plpython.c:3635
#18 0x7f68c41ca9d4 in PyEval_EvalFrameEx () from 
/usr/lib64/libpython2.6.so.1.0
#19 0x7f68c41cc647 in PyEval_EvalCodeEx () from 
/usr/lib64/libpython2.6.so.1.0
#20 0x7f68c41caa94 in PyEval_EvalFrameEx () from 
/usr/lib64/libpython2.6.so.1.0
#21 0x7f68c41cc647 in PyEval_EvalCodeEx () from 
/usr/lib64/libpython2.6.so.1.0
#22 0x7f68c415fd9d in ?? () from /usr/lib64/libpython2.6.so.1.0
#23 0x7f68c4138c63 in PyObject_Call () from 

[jira] [Created] (HAWQ-1135) MADlib: Raising exception leads to database connection termination

2016-11-01 Thread Ming LI (JIRA)
Ming LI created HAWQ-1135:
-

 Summary: MADlib: Raising exception leads to database connection 
termination
 Key: HAWQ-1135
 URL: https://issues.apache.org/jira/browse/HAWQ-1135
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Core
Reporter: Ming LI
Assignee: Lei Chang


MADlib tests on HAWQ 2.0 Nightly builds fails due to server terminating it's 
connection. The failed tests are testing for bad input by returning an 
exception on specific user inputs. These exceptions are raised cleanly in other 
platforms including HAWQ 2.0 and all Greenplum DBs.
Reproduction Steps
Install MADlib using the RPM and HAWQ install script.
Run attached script (called hawq_2.0.1_test.sql)
Current error message is
{{
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
}}
Expected error is
{{
ERROR: spiexceptions.InvalidParameterValue: Function 
"madlib.lmf_igd_transition(double precision[],integer,integer,double 
precision,double precision[],integer,integer,integer,double precision,double 
precision)": Invalid type conversion. Null where not expected.
}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HAWQ-1098) build error when "configure --prefix" with different directory without running "make distclean" previously

2016-10-13 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI resolved HAWQ-1098.
---
Resolution: Fixed

> build error when "configure --prefix" with different directory without 
> running "make distclean" previously
> --
>
> Key: HAWQ-1098
> URL: https://issues.apache.org/jira/browse/HAWQ-1098
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Build
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: backlog
>
>
> Some customer report at:
> http://stackoverflow.com/questions/39217467/hawq-installation-on-redhat
> If "configure --prefix" with different directory without running "make 
> distclean" previously, it will report building error:
> {code}
> ld: warning: directory not found for option 
> '-L/Users/gpadmin/workspace/hawq2/apache-hawq/depends/libhdfs3/build/install/Users/gpadmin/workspace/hawq2/hawq-db-devel3/lib'
> ld: warning: directory not found for option 
> '-L/Users/gpadmin/workspace/hawq2/apache-hawq/depends/libyarn/build/install/Users/gpadmin/workspace/hawq2/hawq-db-devel3/lib'
> ld: library not found for -lhdfs3
> clang: error: linker command failed with exit code 1 (use -v to see 
> invocation)
> make[2]: *** [postgres] Error 1
> make[1]: *** [all] Error 2
> make: *** [all] Error 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HAWQ-1098) build error when "configure --prefix" with different directory without running "make distclean" previously

2016-10-12 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1098:
-

Assignee: Ming LI  (was: Lei Chang)

> build error when "configure --prefix" with different directory without 
> running "make distclean" previously
> --
>
> Key: HAWQ-1098
> URL: https://issues.apache.org/jira/browse/HAWQ-1098
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Build
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: backlog
>
>
> Some customer report at:
> http://stackoverflow.com/questions/39217467/hawq-installation-on-redhat
> If "configure --prefix" with different directory without running "make 
> distclean" previously, it will report building error:
> {code}
> ld: warning: directory not found for option 
> '-L/Users/gpadmin/workspace/hawq2/apache-hawq/depends/libhdfs3/build/install/Users/gpadmin/workspace/hawq2/hawq-db-devel3/lib'
> ld: warning: directory not found for option 
> '-L/Users/gpadmin/workspace/hawq2/apache-hawq/depends/libyarn/build/install/Users/gpadmin/workspace/hawq2/hawq-db-devel3/lib'
> ld: library not found for -lhdfs3
> clang: error: linker command failed with exit code 1 (use -v to see 
> invocation)
> make[2]: *** [postgres] Error 1
> make[1]: *** [all] Error 2
> make: *** [all] Error 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HAWQ-1098) build error when "configure --prefix" with different directory without running "make distclean" previously

2016-10-12 Thread Ming LI (JIRA)
Ming LI created HAWQ-1098:
-

 Summary: build error when "configure --prefix" with different 
directory without running "make distclean" previously
 Key: HAWQ-1098
 URL: https://issues.apache.org/jira/browse/HAWQ-1098
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Build
Reporter: Ming LI
Assignee: Lei Chang
 Fix For: backlog


Some customer report at:
http://stackoverflow.com/questions/39217467/hawq-installation-on-redhat

If "configure --prefix" with different directory without running "make 
distclean" previously, it will report building error:
{code}
ld: warning: directory not found for option 
'-L/Users/gpadmin/workspace/hawq2/apache-hawq/depends/libhdfs3/build/install/Users/gpadmin/workspace/hawq2/hawq-db-devel3/lib'
ld: warning: directory not found for option 
'-L/Users/gpadmin/workspace/hawq2/apache-hawq/depends/libyarn/build/install/Users/gpadmin/workspace/hawq2/hawq-db-devel3/lib'
ld: library not found for -lhdfs3
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [postgres] Error 1
make[1]: *** [all] Error 2
make: *** [all] Error 2
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HAWQ-1094) Select on INTERNAL table returns wrong results when hdfs blocks have checksum errors

2016-10-10 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1094:
-

Assignee: Ming LI  (was: Lei Chang)

> Select on INTERNAL table returns wrong results when hdfs blocks have checksum 
> errors
> 
>
> Key: HAWQ-1094
> URL: https://issues.apache.org/jira/browse/HAWQ-1094
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Fault Tolerance
>Reporter: Ming LI
>Assignee: Ming LI
>
> I created a parquet table and inserted the following values into the table:
> {code}
> sr37228_repro=# select * from number;
>  id
> 
>   1
>   1
>   1
>   1
>   1
> (5 rows)
> {code}
> I then modified the data in two of the three blocks and tried reading the 
> data again.
> {code}
> Modifying contents of internal table blocks...
> Found hdfs://hdm1.hdp.local:8020/hawq_default/16385/16543/17000/10 in hdfs
> Modifying block 
> /hadoop/hdfs/data/current/BP-2023073008-172.28.21.63-1462922052672/current/finalized/subdir0/subdir0/blk_1073742008
>  on 172.28.21.155
> block_script.sh   
>  100%  228 0.2KB/s   00:00
> Modifying block 
> /hadoop/hdfs/data/current/BP-2023073008-172.28.21.63-1462922052672/current/finalized/subdir0/subdir0/blk_1073742008
>  on 172.28.21.156
> block_script.sh   
>  100%  228 0.2KB/s   00:00
> Running count query again, this time with bad data in two of the three blocks
>  count |id
> ---+--
>  1 |0
>  2 |1
>  1 | 16777216
>  1 | 16777217
> (4 rows)
> Checking Showing file health:
> Checking hdfs://hdm1.hdp.local:8020/hawq_default/16385/16543/17000/10 health
> Connecting to namenode via 
> http://hdm1.hdp.local:50070/fsck?ugi=gpadmin=1=1=1=%2Fhawq_default%2F16385%2F16543%2F17000%2F10
> FSCK started by gpadmin (auth:SIMPLE) from /172.28.21.157 for path 
> /hawq_default/16385/16543/17000/10 at Mon Sep 26 12:07:53 PDT 2016
> /hawq_default/16385/16543/17000/10 206 bytes, 1 block(s):  OK
> 0. BP-2023073008-172.28.21.63-1462922052672:blk_1073742008_1186 len=206 
> repl=3 
> [DatanodeInfoWithStorage[172.28.21.155:50010,DS-1a18c785-48e5-4ab8-9228-b3f6857b952a,DISK],
>  
> DatanodeInfoWithStorage[172.28.19.211:50010,DS-6bf49ae7-6745-448b-803d-d12d93acad1d,DISK],
>  
> DatanodeInfoWithStorage[172.28.21.156:50010,DS-d22b0f7f-7065-42c4-bb66-ea361ec5e56a,DISK]]
> Status: HEALTHY
>  Total size:206 B
>  Total dirs:0
>  Total files:   1
>  Total symlinks:0
>  Total blocks (validated):  1 (avg. block size 206 B)
>  Minimally replicated blocks:   1 (100.0 %)
>  Over-replicated blocks:0 (0.0 %)
>  Under-replicated blocks:   0 (0.0 %)
>  Mis-replicated blocks: 0 (0.0 %)
>  Default replication factor:3
>  Average block replication: 3.0
>  Corrupt blocks:0
>  Missing replicas:  0 (0.0 %)
>  Number of data-nodes:  3
>  Number of racks:   1
> FSCK ended at Mon Sep 26 12:07:53 PDT 2016 in 0 milliseconds
> {code}
> When setupBlockReader reads a bad block using the LocalBlockReader, the 
> reader correctly detects a bad checksum.
> {code}
> 2016-09-26 13:02:09.267021 
> PDT,,,p380682,th7956092160,,,seg-1,"LOG","0","Resource 
> manager discovered local host IPv4 address 
> 127.0.0.1",,,0,,"network_utils.c",210,
> 2016-09-26 13:02:09.267171 
> PDT,,,p380682,th7956092160,,,seg-1,"LOG","0","Resource 
> manager discovered local host IPv4 address 
> 172.28.21.155",,,0,,"network_utils.c",210,
> 2016-09-26 13:02:16.239048 
> PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
>  12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG1","0","Dropping 
> in memory mapping OidInMemHeapMapping",,"SET log_min_messages TO 
> 'debug5'",0,,"cdbinmemheapam.c",293,
> 2016-09-26 13:02:16.239289 
> PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
>  12:32:31 
> PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG3","0","CommitTransactionCommand",,"SET
>  log_min_messages TO 'debug5'",0,,"postgres.c",3131,
> 2016-09-26 13:02:16.239435 
> PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
>  12:32:31 
> PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG3","0","CommitTransaction",,"SET
>  log_min_messages TO 'debug5'",0,,"xact.c",5103,
> 2016-09-26 13:02:16.239819 
> PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
>  12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG3","0","name: 
> unnamed; blockState:   

[jira] [Created] (HAWQ-1094) Select on INTERNAL table returns wrong results when hdfs blocks have checksum errors

2016-10-10 Thread Ming LI (JIRA)
Ming LI created HAWQ-1094:
-

 Summary: Select on INTERNAL table returns wrong results when hdfs 
blocks have checksum errors
 Key: HAWQ-1094
 URL: https://issues.apache.org/jira/browse/HAWQ-1094
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Fault Tolerance
Reporter: Ming LI
Assignee: Lei Chang


I created a parquet table and inserted the following values into the table:

{code}
sr37228_repro=# select * from number;
 id

  1
  1
  1
  1
  1
(5 rows)
{code}

I then modified the data in two of the three blocks and tried reading the data 
again.

{code}
Modifying contents of internal table blocks...

Found hdfs://hdm1.hdp.local:8020/hawq_default/16385/16543/17000/10 in hdfs

Modifying block 
/hadoop/hdfs/data/current/BP-2023073008-172.28.21.63-1462922052672/current/finalized/subdir0/subdir0/blk_1073742008
 on 172.28.21.155
block_script.sh 
   100%  228 0.2KB/s   00:00
Modifying block 
/hadoop/hdfs/data/current/BP-2023073008-172.28.21.63-1462922052672/current/finalized/subdir0/subdir0/blk_1073742008
 on 172.28.21.156
block_script.sh 
   100%  228 0.2KB/s   00:00

Running count query again, this time with bad data in two of the three blocks
 count |id
---+--
 1 |0
 2 |1
 1 | 16777216
 1 | 16777217
(4 rows)


Checking Showing file health:

Checking hdfs://hdm1.hdp.local:8020/hawq_default/16385/16543/17000/10 health
Connecting to namenode via 
http://hdm1.hdp.local:50070/fsck?ugi=gpadmin=1=1=1=%2Fhawq_default%2F16385%2F16543%2F17000%2F10
FSCK started by gpadmin (auth:SIMPLE) from /172.28.21.157 for path 
/hawq_default/16385/16543/17000/10 at Mon Sep 26 12:07:53 PDT 2016
/hawq_default/16385/16543/17000/10 206 bytes, 1 block(s):  OK
0. BP-2023073008-172.28.21.63-1462922052672:blk_1073742008_1186 len=206 repl=3 
[DatanodeInfoWithStorage[172.28.21.155:50010,DS-1a18c785-48e5-4ab8-9228-b3f6857b952a,DISK],
 
DatanodeInfoWithStorage[172.28.19.211:50010,DS-6bf49ae7-6745-448b-803d-d12d93acad1d,DISK],
 
DatanodeInfoWithStorage[172.28.21.156:50010,DS-d22b0f7f-7065-42c4-bb66-ea361ec5e56a,DISK]]

Status: HEALTHY
 Total size:206 B
 Total dirs:0
 Total files:   1
 Total symlinks:0
 Total blocks (validated):  1 (avg. block size 206 B)
 Minimally replicated blocks:   1 (100.0 %)
 Over-replicated blocks:0 (0.0 %)
 Under-replicated blocks:   0 (0.0 %)
 Mis-replicated blocks: 0 (0.0 %)
 Default replication factor:3
 Average block replication: 3.0
 Corrupt blocks:0
 Missing replicas:  0 (0.0 %)
 Number of data-nodes:  3
 Number of racks:   1
FSCK ended at Mon Sep 26 12:07:53 PDT 2016 in 0 milliseconds
{code}

When setupBlockReader reads a bad block using the LocalBlockReader, the reader 
correctly detects a bad checksum.

{code}
2016-09-26 13:02:09.267021 
PDT,,,p380682,th7956092160,,,seg-1,"LOG","0","Resource manager 
discovered local host IPv4 address 127.0.0.1",,,0,,"network_utils.c",210,
2016-09-26 13:02:09.267171 
PDT,,,p380682,th7956092160,,,seg-1,"LOG","0","Resource manager 
discovered local host IPv4 address 
172.28.21.155",,,0,,"network_utils.c",210,
2016-09-26 13:02:16.239048 
PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
 12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG1","0","Dropping in 
memory mapping OidInMemHeapMapping",,"SET log_min_messages TO 
'debug5'",0,,"cdbinmemheapam.c",293,
2016-09-26 13:02:16.239289 
PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
 12:32:31 
PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG3","0","CommitTransactionCommand",,"SET
 log_min_messages TO 'debug5'",0,,"postgres.c",3131,
2016-09-26 13:02:16.239435 
PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
 12:32:31 
PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG3","0","CommitTransaction",,"SET
 log_min_messages TO 'debug5'",0,,"xact.c",5103,
2016-09-26 13:02:16.239819 
PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
 12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG3","0","name: 
unnamed; blockState:   STARTED; state: INPROGR, xid/subid/cid: 6227/1/0, 
nestlvl: 1, children: <>",,"SET log_min_messages TO 
'debug5'",0,,"xact.c",5128,
2016-09-26 13:02:16.239978 
PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26
 12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG1","0","Dropping in 
memory mapping OidInMemOnlyMapping",,"SET log_min_messages TO 
'debug5'",0,,"cdbinmemheapam.c",293,

[jira] [Resolved] (HAWQ-1093) Bump Orca version and enable Orca related exception propagation

2016-10-09 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI resolved HAWQ-1093.
---
   Resolution: Fixed
Fix Version/s: backlog

> Bump Orca version and enable Orca related exception propagation
> ---
>
> Key: HAWQ-1093
> URL: https://issues.apache.org/jira/browse/HAWQ-1093
> Project: Apache HAWQ
>  Issue Type: Improvement
>  Components: Optimizer
>Reporter: Haisheng Yuan
>Assignee: Haisheng Yuan
> Fix For: backlog
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HAWQ-771) Table and function can not be found by non-superuser in specified schema

2016-09-26 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI resolved HAWQ-771.
--
Resolution: Invalid

By default, users cannot access any objects in schemas they do not own. To 
allow that, the owner of the schema must grant the USAGE privilege on the 
schema. 

So please run below SQL before select that table in testrole:
grant usage on schema testschema to testrole;

> Table and function can not be found by non-superuser in specified schema
> 
>
> Key: HAWQ-771
> URL: https://issues.apache.org/jira/browse/HAWQ-771
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Catalog
>Affects Versions: 2.0.0.0-incubating
>Reporter: Ruilong Huo
>Assignee: Ruilong Huo
> Fix For: backlog
>
> Attachments: function.out.bug, function.out.expected, function.sql, 
> table.out.bug, table.out.expected, table.sql
>
>
> With non-superuser, table and function can not be found in specified schema. 
> While:
> 1) they can be found in default schema, i.e., "$user", public
> 2) they can be found by superuser.
> This issue occurs in hawq 2.0 and postgres 9.x. See attached sql file for 
> reproduction steps and out file for expected/actual error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HAWQ-1076) permission denied for using sequence with SELECT/USUAGE privilege

2016-09-26 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522283#comment-15522283
 ] 

Ming LI edited comment on HAWQ-1076 at 9/26/16 7:04 AM:


Two problems fixed here:
1) Default statement only need when INSERT.
2) setval() need UPDATE privilege, while nextval() need USAGE or UPDATE 
privilege.

{quotes}
[ro_user] postgres=> select * from t1;
ERROR: permission denied for relation t1
[gpadmin] postgres=# grant SELECT on table t1 to role1;
GRANT
[ro_user] postgres=> select * from t1;
c1 | c2
---+---
1 | 1
1 | 2
(2 rows)
[ro_user] postgres=> insert into t1 (c1) values(11);
ERROR: permission denied for relation t1
[gpadmin] postgres=# grant INSERT on table t1 to role1;
GRANT
[ro_user] postgres=> insert into t1 (c1) values(11);
ERROR: permission denied for sequence seq1
[gpadmin] postgres=# grant USAGE on sequence seq1 to role1;
GRANT
[ro_user] postgres=> insert into t1 (c1) values(11);
INSERT 0 1
[ro_user] postgres=> select setval('seq1', 1, true) ;
ERROR: permission denied for sequence seq1
[gpadmin] postgres=# grant UPDATE on sequence seq1 to role1;
GRANT
[ro_user] postgres=> select setval('seq1', 1, true) ;
setval

1
(1 row)
{/quotes}


was (Author: mli):
Two problems fixed here:
1) Default statement only need when INSERT.
2) setval() need UPDATE privilege, while nextval() need USAGE or UPDATE 
privilege.

```
[ro_user] postgres=> select * from t1;
ERROR: permission denied for relation t1
[gpadmin] postgres=# grant SELECT on table t1 to role1;
GRANT
[ro_user] postgres=> select * from t1;
c1 | c2
---+---
1 | 1
1 | 2
(2 rows)
[ro_user] postgres=> insert into t1 (c1) values(11);
ERROR: permission denied for relation t1
[gpadmin] postgres=# grant INSERT on table t1 to role1;
GRANT
[ro_user] postgres=> insert into t1 (c1) values(11);
ERROR: permission denied for sequence seq1
[gpadmin] postgres=# grant USAGE on sequence seq1 to role1;
GRANT
[ro_user] postgres=> insert into t1 (c1) values(11);
INSERT 0 1
[ro_user] postgres=> select setval('seq1', 1, true) ;
ERROR: permission denied for sequence seq1
[gpadmin] postgres=# grant UPDATE on sequence seq1 to role1;
GRANT
[ro_user] postgres=> select setval('seq1', 1, true) ;
setval

1
(1 row)
```

> permission denied for using sequence with SELECT/USUAGE privilege
> -
>
> Key: HAWQ-1076
> URL: https://issues.apache.org/jira/browse/HAWQ-1076
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Ming LI
>Assignee: Lei Chang
> Fix For: backlog
>
>
> Customer had a table with a column taking default value from a sequence. And 
> they want a role have readonly access to the table as well as the sequence. 
> However they have to grant ALL privilege on the sequence to the user for 
> running SELECT query. Otherwise it will fail with "ERROR:  permission denied 
> for sequence xxx".
> Following are the steps to reproduce the issue in house.
> 1. Create a table with column taking default value from a sequence. And grant 
> SELECT/USAGE privilege on the sequence to a user
> {code:java}
> [gpadmin@hdm1 ~]$ psql
> psql (8.2.15)
> Type "help" for help.
> gpadmin=# \d ns1.t1
>Append-Only Table "ns1.t1"
>  Column |  Type   |  Modifiers  
> +-+-
>  c1 | text| 
>  c2 | integer | not null default nextval('ns1.t1_c2_seq'::regclass)
> Compression Type: None
> Compression Level: 0
> Block Size: 32768
> Checksum: f
> Distributed randomly
> gpadmin=# grant SELECT,usage on sequence ns1.t1_c2_seq to ro_user;
> GRANT
> gpadmin=# select * from pg_class where relname='t1_c2_seq';
>   relname  | relnamespace | reltype | relowner | relam | relfilenode | 
> reltablespace | relpages | reltuples | reltoast
> relid | reltoastidxid | relaosegrelid | relaosegidxid | relhasindex | 
> relisshared | relkind | relstorage | relnatts | 
> relchecks | reltriggers | relukeys | relfkeys | relrefs | relhasoids | 
> relhaspkey | relhasrules | relhassubclass | rel
> frozenxid |  relacl  | reloptions 
> ---+--+-+--+---+-+---+--+---+-
> --+---+---+---+-+-+-++--+-
> --+-+--+--+-+++-++
> --+--+
>  t1_c2_seq |17638 |   17650 |   10 | 0 |   17649 |
>  0 |1 | 1 | 
> 0 | 0 | 0 | 0 | f   | f   

[jira] [Commented] (HAWQ-1076) permission denied for using sequence with SELECT/USUAGE privilege

2016-09-26 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522283#comment-15522283
 ] 

Ming LI commented on HAWQ-1076:
---

Two problems fixed here:
1) Default statement only need when INSERT.
2) setval() need UPDATE privilege, while nextval() need USAGE or UPDATE 
privilege.

```
[ro_user] postgres=> select * from t1;
ERROR: permission denied for relation t1
[gpadmin] postgres=# grant SELECT on table t1 to role1;
GRANT
[ro_user] postgres=> select * from t1;
c1 | c2
---+---
1 | 1
1 | 2
(2 rows)
[ro_user] postgres=> insert into t1 (c1) values(11);
ERROR: permission denied for relation t1
[gpadmin] postgres=# grant INSERT on table t1 to role1;
GRANT
[ro_user] postgres=> insert into t1 (c1) values(11);
ERROR: permission denied for sequence seq1
[gpadmin] postgres=# grant USAGE on sequence seq1 to role1;
GRANT
[ro_user] postgres=> insert into t1 (c1) values(11);
INSERT 0 1
[ro_user] postgres=> select setval('seq1', 1, true) ;
ERROR: permission denied for sequence seq1
[gpadmin] postgres=# grant UPDATE on sequence seq1 to role1;
GRANT
[ro_user] postgres=> select setval('seq1', 1, true) ;
setval

1
(1 row)
```

> permission denied for using sequence with SELECT/USUAGE privilege
> -
>
> Key: HAWQ-1076
> URL: https://issues.apache.org/jira/browse/HAWQ-1076
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Ming LI
>Assignee: Lei Chang
> Fix For: backlog
>
>
> Customer had a table with a column taking default value from a sequence. And 
> they want a role have readonly access to the table as well as the sequence. 
> However they have to grant ALL privilege on the sequence to the user for 
> running SELECT query. Otherwise it will fail with "ERROR:  permission denied 
> for sequence xxx".
> Following are the steps to reproduce the issue in house.
> 1. Create a table with column taking default value from a sequence. And grant 
> SELECT/USAGE privilege on the sequence to a user
> {code:java}
> [gpadmin@hdm1 ~]$ psql
> psql (8.2.15)
> Type "help" for help.
> gpadmin=# \d ns1.t1
>Append-Only Table "ns1.t1"
>  Column |  Type   |  Modifiers  
> +-+-
>  c1 | text| 
>  c2 | integer | not null default nextval('ns1.t1_c2_seq'::regclass)
> Compression Type: None
> Compression Level: 0
> Block Size: 32768
> Checksum: f
> Distributed randomly
> gpadmin=# grant SELECT,usage on sequence ns1.t1_c2_seq to ro_user;
> GRANT
> gpadmin=# select * from pg_class where relname='t1_c2_seq';
>   relname  | relnamespace | reltype | relowner | relam | relfilenode | 
> reltablespace | relpages | reltuples | reltoast
> relid | reltoastidxid | relaosegrelid | relaosegidxid | relhasindex | 
> relisshared | relkind | relstorage | relnatts | 
> relchecks | reltriggers | relukeys | relfkeys | relrefs | relhasoids | 
> relhaspkey | relhasrules | relhassubclass | rel
> frozenxid |  relacl  | reloptions 
> ---+--+-+--+---+-+---+--+---+-
> --+---+---+---+-+-+-++--+-
> --+-+--+--+-+++-++
> --+--+
>  t1_c2_seq |17638 |   17650 |   10 | 0 |   17649 |
>  0 |1 | 1 | 
> 0 | 0 | 0 | 0 | f   | f   
> | S   | h  |9 | 
> 0 |   0 |0 |0 |   0 | f  | f  
> | f   | f  |
> 0 | {gpadmin=rwU/gpadmin,ro_user=rU/gpadmin} | 
> (1 row)
> gpadmin=# insert into ns1.t1(c1) values('abc');
> INSERT 0 1
> gpadmin=# select * from ns1.t1;
>  c1  | c2 
> -+
>  abc |  3
> (1 row)
> {code}
> 2. Connect to database as user with readonly access and run SELECT query 
> against the table. It will fail with "permission denied" error
> {code:java}
> [gpadmin@hdm1 ~]$ psql -U ro_user -d gpadmin
> psql (8.2.15)
> Type "help" for help.
> gpadmin=> select * from ns1.t1;
> ERROR:  permission denied for sequence t1_c2_seq
> {code}
> 3. grant ALL privilege on the sequence to that user, which makes it be able 
> to SELECT out data from the table
> {code:java}
> [gpadmin@hdm1 ~]$ psql
> gpadmin-# psql (8.2.15)
> gpadmin-# Type "help" for help.
> gpadmin-# 
> gpadmin=# grant update on sequence ns1.t1_c2_seq to ro_user;
> GRANT
> gpadmin=# select * from pg_class where relname='t1_c2_seq';
> 

[jira] [Created] (HAWQ-1076) permission denied for using sequence with SELECT/USUAGE privilege

2016-09-26 Thread Ming LI (JIRA)
Ming LI created HAWQ-1076:
-

 Summary: permission denied for using sequence with SELECT/USUAGE 
privilege
 Key: HAWQ-1076
 URL: https://issues.apache.org/jira/browse/HAWQ-1076
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Catalog
Reporter: Ming LI
Assignee: Lei Chang
 Fix For: backlog


Customer had a table with a column taking default value from a sequence. And 
they want a role have readonly access to the table as well as the sequence. 
However they have to grant ALL privilege on the sequence to the user for 
running SELECT query. Otherwise it will fail with "ERROR:  permission denied 
for sequence xxx".

Following are the steps to reproduce the issue in house.

1. Create a table with column taking default value from a sequence. And grant 
SELECT/USAGE privilege on the sequence to a user
{code:java}
[gpadmin@hdm1 ~]$ psql
psql (8.2.15)
Type "help" for help.

gpadmin=# \d ns1.t1
   Append-Only Table "ns1.t1"
 Column |  Type   |  Modifiers  
+-+-
 c1 | text| 
 c2 | integer | not null default nextval('ns1.t1_c2_seq'::regclass)
Compression Type: None
Compression Level: 0
Block Size: 32768
Checksum: f
Distributed randomly

gpadmin=# grant SELECT,usage on sequence ns1.t1_c2_seq to ro_user;
GRANT

gpadmin=# select * from pg_class where relname='t1_c2_seq';
  relname  | relnamespace | reltype | relowner | relam | relfilenode | 
reltablespace | relpages | reltuples | reltoast
relid | reltoastidxid | relaosegrelid | relaosegidxid | relhasindex | 
relisshared | relkind | relstorage | relnatts | 
relchecks | reltriggers | relukeys | relfkeys | relrefs | relhasoids | 
relhaspkey | relhasrules | relhassubclass | rel
frozenxid |  relacl  | reloptions 
---+--+-+--+---+-+---+--+---+-
--+---+---+---+-+-+-++--+-
--+-+--+--+-+++-++
--+--+
 t1_c2_seq |17638 |   17650 |   10 | 0 |   17649 |  
   0 |1 | 1 | 
0 | 0 | 0 | 0 | f   | f 
  | S   | h  |9 | 
0 |   0 |0 |0 |   0 | f  | f
  | f   | f  |
0 | {gpadmin=rwU/gpadmin,ro_user=rU/gpadmin} | 
(1 row)

gpadmin=# insert into ns1.t1(c1) values('abc');
INSERT 0 1
gpadmin=# select * from ns1.t1;
 c1  | c2 
-+
 abc |  3
(1 row)
{code}

2. Connect to database as user with readonly access and run SELECT query 
against the table. It will fail with "permission denied" error
{code:java}
[gpadmin@hdm1 ~]$ psql -U ro_user -d gpadmin
psql (8.2.15)
Type "help" for help.

gpadmin=> select * from ns1.t1;
ERROR:  permission denied for sequence t1_c2_seq
{code}

3. grant ALL privilege on the sequence to that user, which makes it be able to 
SELECT out data from the table

{code:java}
[gpadmin@hdm1 ~]$ psql
gpadmin-# psql (8.2.15)
gpadmin-# Type "help" for help.
gpadmin-# 
gpadmin=# grant update on sequence ns1.t1_c2_seq to ro_user;
GRANT
gpadmin=# select * from pg_class where relname='t1_c2_seq';
  relname  | relnamespace | reltype | relowner | relam | relfilenode | 
reltablespace | relpages | reltuples | reltoast
relid | reltoastidxid | relaosegrelid | relaosegidxid | relhasindex | 
relisshared | relkind | relstorage | relnatts | 
relchecks | reltriggers | relukeys | relfkeys | relrefs | relhasoids | 
relhaspkey | relhasrules | relhassubclass | rel
frozenxid |  relacl   | reloptions 
---+--+-+--+---+-+---+--+---+-
--+---+---+---+-+-+-++--+-
--+-+--+--+-+++-++
--+---+
 t1_c2_seq |17638 |   17650 |   10 | 0 |   17649 |  
   0 |1 | 1 | 
0 | 0 | 0 | 0 | f   | f 
  | S   | h  |9 | 
0 |   0 |0 |0 |   0 | f  | f
  | f   | f  |
0 | {gpadmin=rwU/gpadmin,ro_user=rwU/gpadmin} | 
(1 row)

gpadmin=# \q
[gpadmin@hdm1 ~]$ psql -U ro_user -d gpadmin
psql (8.2.15)
Type "help" 

[jira] [Commented] (HAWQ-1068) master process panic with signal 11 when call get_ao_compression_ratio(null)

2016-09-22 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15512444#comment-15512444
 ] 

Ming LI commented on HAWQ-1068:
---

Add check for this function:

postgres=# select get_ao_compression_ratio(null);
ERROR:  failed to get relname for this function. (aosegfiles.c:1553)
postgres=# select get_ao_compression_ratio(0);
ERROR:  failed to get valid relation id for this function. (aosegfiles.c:1575)

> master process panic with signal 11 when call get_ao_compression_ratio(null)
> 
>
> Key: HAWQ-1068
> URL: https://issues.apache.org/jira/browse/HAWQ-1068
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Ming LI
>Assignee: Lei Chang
> Fix For: backlog
>
>
> Customer has a function which will call get_ao_compression_ratio() to pass 
> the relation name dynamically. However in some corner case 
> get_ao_compression_ratio() will be passed with NULL, which triggered a signal 
> 11 and crashed the master process. 
> The issue could be easily reproduced in house as shown below:
> gpadmin=# select get_ao_compression_ratio(null);
> server closed the connection unexpectedly
>   This probably means the server terminated abnormally
>   before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> !> \q



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HAWQ-1068) master process panic with signal 11 when call get_ao_compression_ratio(null)

2016-09-22 Thread Ming LI (JIRA)
Ming LI created HAWQ-1068:
-

 Summary: master process panic with signal 11 when call 
get_ao_compression_ratio(null)
 Key: HAWQ-1068
 URL: https://issues.apache.org/jira/browse/HAWQ-1068
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Catalog
Reporter: Ming LI
Assignee: Lei Chang
 Fix For: backlog


Customer has a function which will call get_ao_compression_ratio() to pass the 
relation name dynamically. However in some corner case 
get_ao_compression_ratio() will be passed with NULL, which triggered a signal 
11 and crashed the master process. 
The issue could be easily reproduced in house as shown below:
gpadmin=# select get_ao_compression_ratio(null);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> \q



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HAWQ-1030) User hang due to poor spin-lock/LWLock performance under high concurrency

2016-08-29 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI resolved HAWQ-1030.
---
Resolution: Fixed

> User hang due to poor spin-lock/LWLock performance under high concurrency
> -
>
> Key: HAWQ-1030
> URL: https://issues.apache.org/jira/browse/HAWQ-1030
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: 2.0.1.0-incubating
>
>
> Some clients have recently reported apparent hangs with their applications. 
> In all cases the symptoms were the same:
> * All sessions appear to be hung in LWLockAcquire or Release, specifically 
> s_lock
> * there is a high number of concurrent sessions (close to 100)
> * System is not actually hung, normally processing resumes after some period 
> of time when all sessions have completed their locking work
> The postgresql developer community has found several issues with performance 
> under high concurrency (> 32 sessions) in the spin-lock mechanism we've 
> inherited in HAWQ. This ultimately has been corrected in 9.5 with a 
> replacement to the spin-lock mechanism and appears to provide a significant 
> boost to query performance.
> The actual fix is in commit: ab5194e6f617a9a9e7aadb3dd1cee948a42d0755
> Only 1 line commit to s_lock.c could help address this and would be easy 
> enough to cherry-pick: b03d196be055450c7260749f17347c2d066b4254



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAWQ-1030) User hang due to poor spin-lock/LWLock performance under high concurrency

2016-08-29 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447862#comment-15447862
 ] 

Ming LI commented on HAWQ-1030:
---

Hawq didn't change the implementation of spin lock which is same with 
postgresql, thinking that porting from new version of postgresql in the future, 
here we keep the code more similar to latest postgresql (version 9.6).

> User hang due to poor spin-lock/LWLock performance under high concurrency
> -
>
> Key: HAWQ-1030
> URL: https://issues.apache.org/jira/browse/HAWQ-1030
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: 2.0.1.0-incubating
>
>
> Some clients have recently reported apparent hangs with their applications. 
> In all cases the symptoms were the same:
> * All sessions appear to be hung in LWLockAcquire or Release, specifically 
> s_lock
> * there is a high number of concurrent sessions (close to 100)
> * System is not actually hung, normally processing resumes after some period 
> of time when all sessions have completed their locking work
> The postgresql developer community has found several issues with performance 
> under high concurrency (> 32 sessions) in the spin-lock mechanism we've 
> inherited in HAWQ. This ultimately has been corrected in 9.5 with a 
> replacement to the spin-lock mechanism and appears to provide a significant 
> boost to query performance.
> The actual fix is in commit: ab5194e6f617a9a9e7aadb3dd1cee948a42d0755
> Only 1 line commit to s_lock.c could help address this and would be easy 
> enough to cherry-pick: b03d196be055450c7260749f17347c2d066b4254



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HAWQ-1030) User hang due to poor spin-lock/LWLock performance under high concurrency

2016-08-29 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1030:
-

Assignee: Ming LI  (was: Lei Chang)

> User hang due to poor spin-lock/LWLock performance under high concurrency
> -
>
> Key: HAWQ-1030
> URL: https://issues.apache.org/jira/browse/HAWQ-1030
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Core
>Reporter: Ming LI
>Assignee: Ming LI
>
> Some clients have recently reported apparent hangs with their applications. 
> In all cases the symptoms were the same:
> * All sessions appear to be hung in LWLockAcquire or Release, specifically 
> s_lock
> * there is a high number of concurrent sessions (close to 100)
> * System is not actually hung, normally processing resumes after some period 
> of time when all sessions have completed their locking work
> The postgresql developer community has found several issues with performance 
> under high concurrency (> 32 sessions) in the spin-lock mechanism we've 
> inherited in HAWQ. This ultimately has been corrected in 9.5 with a 
> replacement to the spin-lock mechanism and appears to provide a significant 
> boost to query performance.
> The actual fix is in commit: ab5194e6f617a9a9e7aadb3dd1cee948a42d0755
> Only 1 line commit to s_lock.c could help address this and would be easy 
> enough to cherry-pick: b03d196be055450c7260749f17347c2d066b4254



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HAWQ-1030) User hang due to poor spin-lock/LWLock performance under high concurrency

2016-08-29 Thread Ming LI (JIRA)
Ming LI created HAWQ-1030:
-

 Summary: User hang due to poor spin-lock/LWLock performance under 
high concurrency
 Key: HAWQ-1030
 URL: https://issues.apache.org/jira/browse/HAWQ-1030
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Core
Reporter: Ming LI
Assignee: Lei Chang


Some clients have recently reported apparent hangs with their applications. In 
all cases the symptoms were the same:

* All sessions appear to be hung in LWLockAcquire or Release, specifically 
s_lock
* there is a high number of concurrent sessions (close to 100)
* System is not actually hung, normally processing resumes after some period of 
time when all sessions have completed their locking work

The postgresql developer community has found several issues with performance 
under high concurrency (> 32 sessions) in the spin-lock mechanism we've 
inherited in HAWQ. This ultimately has been corrected in 9.5 with a replacement 
to the spin-lock mechanism and appears to provide a significant boost to query 
performance.

The actual fix is in commit: ab5194e6f617a9a9e7aadb3dd1cee948a42d0755

Only 1 line commit to s_lock.c could help address this and would be easy enough 
to cherry-pick: b03d196be055450c7260749f17347c2d066b4254



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HAWQ-1000) Set dummy workfile pointer to NULL after calling ExecWorkFile_Close()

2016-08-11 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI resolved HAWQ-1000.
---
Resolution: Fixed

> Set dummy workfile pointer to NULL after calling ExecWorkFile_Close()
> -
>
> Key: HAWQ-1000
> URL: https://issues.apache.org/jira/browse/HAWQ-1000
> Project: Apache HAWQ
>  Issue Type: Bug
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: 2.0.1.0-incubating
>
>
> The parameter workfile for ExecWorkFile_Close() is freed in this function, 
> but in the calling function outside, the pointer variable still exists, we 
> need to set it to NULL pointer immediately, otherwise it will use some freed 
> pointer afterward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HAWQ-1000) Set dummy workfile pointer to NULL after calling ExecWorkFile_Close()

2016-08-11 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1000:
-

Assignee: Ming LI  (was: Lei Chang)

> Set dummy workfile pointer to NULL after calling ExecWorkFile_Close()
> -
>
> Key: HAWQ-1000
> URL: https://issues.apache.org/jira/browse/HAWQ-1000
> Project: Apache HAWQ
>  Issue Type: Bug
>Reporter: Ming LI
>Assignee: Ming LI
>
> The parameter workfile for ExecWorkFile_Close() is freed in this function, 
> but in the calling function outside, the pointer variable still exists, we 
> need to set it to NULL pointer immediately, otherwise it will use some freed 
> pointer afterward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HAWQ-1000) Set dummy workfile pointer to NULL after calling ExecWorkFile_Close()

2016-08-11 Thread Ming LI (JIRA)
Ming LI created HAWQ-1000:
-

 Summary: Set dummy workfile pointer to NULL after calling 
ExecWorkFile_Close()
 Key: HAWQ-1000
 URL: https://issues.apache.org/jira/browse/HAWQ-1000
 Project: Apache HAWQ
  Issue Type: Bug
Reporter: Ming LI
Assignee: Lei Chang


The parameter workfile for ExecWorkFile_Close() is freed in this function, but 
in the calling function outside, the pointer variable still exists, we need to 
set it to NULL pointer immediately, otherwise it will use some freed pointer 
afterward.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HAWQ-994) PL/R UDF need to be separated from postgres process for robustness

2016-08-10 Thread Ming LI (JIRA)
Ming LI created HAWQ-994:


 Summary: PL/R UDF need to be separated from postgres process for 
robustness
 Key: HAWQ-994
 URL: https://issues.apache.org/jira/browse/HAWQ-994
 Project: Apache HAWQ
  Issue Type: New Feature
Reporter: Ming LI
Assignee: Lei Chang


Background:
In previous single node DB, user always deploy testing code on another testing 
DB. Now the data maintained in HAWQ grows enormously, so it is hard to deploy a 
testing hawq with the same test data. 

So user need to run some testing UDF or deploy some UDFs which lack of testing 
the whole data directly onto hawq in production env, which may crash in PL/R or 
R code. Sometimes poorly written query leads to postmaster reset causing all 
running jobs to be cancelled and rolled back. Customer often sees this as a 
HAWQ issue even if it is a user code issue. So we need to separated from 
postgres process, and change inter process communication from shared memory to 
others(e.g. pipe, socket and so on).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HAWQ-978) long running query got hang on master and can't be terminated

2016-08-03 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI resolved HAWQ-978.
--
Resolution: Fixed

> long running query got hang on master and can't be terminated
> -
>
> Key: HAWQ-978
> URL: https://issues.apache.org/jira/browse/HAWQ-978
> Project: Apache HAWQ
>  Issue Type: Bug
>Reporter: Ming LI
>Assignee: Ming LI
> Fix For: 2.0.1.0-incubating
>
>
> One backend process on master had been running for several days and can't be 
> terminated.
> The session is idle on all segments but master instance.
> pstack/strace/back trace of the backend process.
> {code}
> [gpadmin@alpmdwgp1prd ~]$ pstack 423984
> Thread 2 (Thread 0x7f0457844700 (LWP 424026)):
> #0  0x7f04756670d3 in poll () from /lib64/libc.so.6
> #1  0x00b90114 in rxThreadFunc ()
> #2  0x7f0475e889d1 in start_thread () from /lib64/libpthread.so.0
> #3  0x7f04756708fd in clone () from /lib64/libc.so.6
> Thread 1 (Thread 0x7f047862b720 (LWP 423984)):
> #0  0x7f047568005e in __lll_lock_wait_private () from /lib64/libc.so.6
> #1  0x7f0475604dc0 in _L_lock_5199 () from /lib64/libc.so.6
> #2  0x7f047560071b in _int_free () from /lib64/libc.so.6
> #3  0x00b1be91 in gp_free2 ()
> #4  0x00b10acc in AllocSetDelete ()
> #5  0x00b1468b in MemoryContextDeleteImpl ()
> #6  0x00aaf0f1 in RelationDestroyRelation ()
> #7  0x00ab60f2 in RelationCacheInvalidate ()
> #8  0x00aa9453 in InvalidateSystemCaches ()
> #9  0x00937eeb in ReceiveSharedInvalidMessages ()
> #10 0x0093c295 in LockRelationOid ()
> #11 0x004d8afd in heap_open ()
> #12 0x00aa46d4 in SearchCatCache ()
> #13 0x005c6512 in caql_getnext ()
> #14 0x00749153 in sql_exec_error_callback ()
> #15 0x00ad6e5a in errfinish ()
> #16 0x00ad8ed9 in elog_finish ()
> #17 0x00944e6b in handle_sig_alarm ()
> #18 
> #19 0x7f047560168f in _int_malloc () from /lib64/libc.so.6
> #20 0x7f04756026b1 in malloc () from /lib64/libc.so.6
> #21 0x00b1c2c1 in gp_malloc ()
> #22 0x00b1259c in AllocSetAlloc ()
> #23 0x00b15f5d in MemoryContextAllocZeroImpl ()
> #24 0x00b6cb4f in initMotionLayerStructs ()
> #25 0x007275e0 in ExecutorStart ()
> #26 0x00749a2e in fmgr_sql ()
> #27 0x0072e316 in ExecMakeFunctionResultNoSets ()
> #28 0x0072e129 in ExecMakeFunctionResultNoSets ()
> #29 0x00733312 in ExecProject ()
> #30 0x007602c7 in ExecHashJoin ()
> #31 0x0072ca84 in ExecProcNode ()
> #32 0x0076bf38 in ExecSort ()
> #33 0x0072caa6 in ExecProcNode ()
> #34 0x0072199c in ExecutePlan ()
> #35 0x007221a8 in ExecutorRun ()
> #36 0x00971e09 in PortalRun ()
> #37 0x00966968 in exec_simple_query ()
> #38 0x00969ab9 in PostgresMain ()
> #39 0x008c707e in ServerLoop ()
> #40 0x008c9e20 in PostmasterMain ()
> #41 0x007c85af in main ()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAWQ-978) long running query got hang on master and can't be terminated

2016-08-03 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405589#comment-15405589
 ] 

Ming LI commented on HAWQ-978:
--

The best way to avoid this kind of deadlock is to Call only asynchronous-safe 
functions within signal handlers.
https://www.securecoding.cert.org/confluence/display/c/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers

> long running query got hang on master and can't be terminated
> -
>
> Key: HAWQ-978
> URL: https://issues.apache.org/jira/browse/HAWQ-978
> Project: Apache HAWQ
>  Issue Type: Bug
>Reporter: Ming LI
>Assignee: Ming LI
>
> One backend process on master had been running for several days and can't be 
> terminated.
> The session is idle on all segments but master instance.
> pstack/strace/back trace of the backend process.
> {code}
> [gpadmin@alpmdwgp1prd ~]$ pstack 423984
> Thread 2 (Thread 0x7f0457844700 (LWP 424026)):
> #0  0x7f04756670d3 in poll () from /lib64/libc.so.6
> #1  0x00b90114 in rxThreadFunc ()
> #2  0x7f0475e889d1 in start_thread () from /lib64/libpthread.so.0
> #3  0x7f04756708fd in clone () from /lib64/libc.so.6
> Thread 1 (Thread 0x7f047862b720 (LWP 423984)):
> #0  0x7f047568005e in __lll_lock_wait_private () from /lib64/libc.so.6
> #1  0x7f0475604dc0 in _L_lock_5199 () from /lib64/libc.so.6
> #2  0x7f047560071b in _int_free () from /lib64/libc.so.6
> #3  0x00b1be91 in gp_free2 ()
> #4  0x00b10acc in AllocSetDelete ()
> #5  0x00b1468b in MemoryContextDeleteImpl ()
> #6  0x00aaf0f1 in RelationDestroyRelation ()
> #7  0x00ab60f2 in RelationCacheInvalidate ()
> #8  0x00aa9453 in InvalidateSystemCaches ()
> #9  0x00937eeb in ReceiveSharedInvalidMessages ()
> #10 0x0093c295 in LockRelationOid ()
> #11 0x004d8afd in heap_open ()
> #12 0x00aa46d4 in SearchCatCache ()
> #13 0x005c6512 in caql_getnext ()
> #14 0x00749153 in sql_exec_error_callback ()
> #15 0x00ad6e5a in errfinish ()
> #16 0x00ad8ed9 in elog_finish ()
> #17 0x00944e6b in handle_sig_alarm ()
> #18 
> #19 0x7f047560168f in _int_malloc () from /lib64/libc.so.6
> #20 0x7f04756026b1 in malloc () from /lib64/libc.so.6
> #21 0x00b1c2c1 in gp_malloc ()
> #22 0x00b1259c in AllocSetAlloc ()
> #23 0x00b15f5d in MemoryContextAllocZeroImpl ()
> #24 0x00b6cb4f in initMotionLayerStructs ()
> #25 0x007275e0 in ExecutorStart ()
> #26 0x00749a2e in fmgr_sql ()
> #27 0x0072e316 in ExecMakeFunctionResultNoSets ()
> #28 0x0072e129 in ExecMakeFunctionResultNoSets ()
> #29 0x00733312 in ExecProject ()
> #30 0x007602c7 in ExecHashJoin ()
> #31 0x0072ca84 in ExecProcNode ()
> #32 0x0076bf38 in ExecSort ()
> #33 0x0072caa6 in ExecProcNode ()
> #34 0x0072199c in ExecutePlan ()
> #35 0x007221a8 in ExecutorRun ()
> #36 0x00971e09 in PortalRun ()
> #37 0x00966968 in exec_simple_query ()
> #38 0x00969ab9 in PostgresMain ()
> #39 0x008c707e in ServerLoop ()
> #40 0x008c9e20 in PostmasterMain ()
> #41 0x007c85af in main ()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HAWQ-978) long running query got hang on master and can't be terminated

2016-08-03 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-978:


Assignee: Ming LI  (was: Lei Chang)

> long running query got hang on master and can't be terminated
> -
>
> Key: HAWQ-978
> URL: https://issues.apache.org/jira/browse/HAWQ-978
> Project: Apache HAWQ
>  Issue Type: Bug
>Reporter: Ming LI
>Assignee: Ming LI
>
> One backend process on master had been running for several days and can't be 
> terminated.
> The session is idle on all segments but master instance.
> pstack/strace/back trace of the backend process.
> {code}
> [gpadmin@alpmdwgp1prd ~]$ pstack 423984
> Thread 2 (Thread 0x7f0457844700 (LWP 424026)):
> #0  0x7f04756670d3 in poll () from /lib64/libc.so.6
> #1  0x00b90114 in rxThreadFunc ()
> #2  0x7f0475e889d1 in start_thread () from /lib64/libpthread.so.0
> #3  0x7f04756708fd in clone () from /lib64/libc.so.6
> Thread 1 (Thread 0x7f047862b720 (LWP 423984)):
> #0  0x7f047568005e in __lll_lock_wait_private () from /lib64/libc.so.6
> #1  0x7f0475604dc0 in _L_lock_5199 () from /lib64/libc.so.6
> #2  0x7f047560071b in _int_free () from /lib64/libc.so.6
> #3  0x00b1be91 in gp_free2 ()
> #4  0x00b10acc in AllocSetDelete ()
> #5  0x00b1468b in MemoryContextDeleteImpl ()
> #6  0x00aaf0f1 in RelationDestroyRelation ()
> #7  0x00ab60f2 in RelationCacheInvalidate ()
> #8  0x00aa9453 in InvalidateSystemCaches ()
> #9  0x00937eeb in ReceiveSharedInvalidMessages ()
> #10 0x0093c295 in LockRelationOid ()
> #11 0x004d8afd in heap_open ()
> #12 0x00aa46d4 in SearchCatCache ()
> #13 0x005c6512 in caql_getnext ()
> #14 0x00749153 in sql_exec_error_callback ()
> #15 0x00ad6e5a in errfinish ()
> #16 0x00ad8ed9 in elog_finish ()
> #17 0x00944e6b in handle_sig_alarm ()
> #18 
> #19 0x7f047560168f in _int_malloc () from /lib64/libc.so.6
> #20 0x7f04756026b1 in malloc () from /lib64/libc.so.6
> #21 0x00b1c2c1 in gp_malloc ()
> #22 0x00b1259c in AllocSetAlloc ()
> #23 0x00b15f5d in MemoryContextAllocZeroImpl ()
> #24 0x00b6cb4f in initMotionLayerStructs ()
> #25 0x007275e0 in ExecutorStart ()
> #26 0x00749a2e in fmgr_sql ()
> #27 0x0072e316 in ExecMakeFunctionResultNoSets ()
> #28 0x0072e129 in ExecMakeFunctionResultNoSets ()
> #29 0x00733312 in ExecProject ()
> #30 0x007602c7 in ExecHashJoin ()
> #31 0x0072ca84 in ExecProcNode ()
> #32 0x0076bf38 in ExecSort ()
> #33 0x0072caa6 in ExecProcNode ()
> #34 0x0072199c in ExecutePlan ()
> #35 0x007221a8 in ExecutorRun ()
> #36 0x00971e09 in PortalRun ()
> #37 0x00966968 in exec_simple_query ()
> #38 0x00969ab9 in PostgresMain ()
> #39 0x008c707e in ServerLoop ()
> #40 0x008c9e20 in PostmasterMain ()
> #41 0x007c85af in main ()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HAWQ-978) long running query got hang on master and can't be terminated

2016-08-03 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI updated HAWQ-978:
-
Description: 
One backend process on master had been running for several days and can't be 
terminated.
The session is idle on all segments but master instance.

pstack/strace/back trace of the backend process.

{code}
[gpadmin@alpmdwgp1prd ~]$ pstack 423984
Thread 2 (Thread 0x7f0457844700 (LWP 424026)):
#0  0x7f04756670d3 in poll () from /lib64/libc.so.6
#1  0x00b90114 in rxThreadFunc ()
#2  0x7f0475e889d1 in start_thread () from /lib64/libpthread.so.0
#3  0x7f04756708fd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f047862b720 (LWP 423984)):
#0  0x7f047568005e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x7f0475604dc0 in _L_lock_5199 () from /lib64/libc.so.6
#2  0x7f047560071b in _int_free () from /lib64/libc.so.6
#3  0x00b1be91 in gp_free2 ()
#4  0x00b10acc in AllocSetDelete ()
#5  0x00b1468b in MemoryContextDeleteImpl ()
#6  0x00aaf0f1 in RelationDestroyRelation ()
#7  0x00ab60f2 in RelationCacheInvalidate ()
#8  0x00aa9453 in InvalidateSystemCaches ()
#9  0x00937eeb in ReceiveSharedInvalidMessages ()
#10 0x0093c295 in LockRelationOid ()
#11 0x004d8afd in heap_open ()
#12 0x00aa46d4 in SearchCatCache ()
#13 0x005c6512 in caql_getnext ()
#14 0x00749153 in sql_exec_error_callback ()
#15 0x00ad6e5a in errfinish ()
#16 0x00ad8ed9 in elog_finish ()
#17 0x00944e6b in handle_sig_alarm ()
#18 
#19 0x7f047560168f in _int_malloc () from /lib64/libc.so.6
#20 0x7f04756026b1 in malloc () from /lib64/libc.so.6
#21 0x00b1c2c1 in gp_malloc ()
#22 0x00b1259c in AllocSetAlloc ()
#23 0x00b15f5d in MemoryContextAllocZeroImpl ()
#24 0x00b6cb4f in initMotionLayerStructs ()
#25 0x007275e0 in ExecutorStart ()
#26 0x00749a2e in fmgr_sql ()
#27 0x0072e316 in ExecMakeFunctionResultNoSets ()
#28 0x0072e129 in ExecMakeFunctionResultNoSets ()
#29 0x00733312 in ExecProject ()
#30 0x007602c7 in ExecHashJoin ()
#31 0x0072ca84 in ExecProcNode ()
#32 0x0076bf38 in ExecSort ()
#33 0x0072caa6 in ExecProcNode ()
#34 0x0072199c in ExecutePlan ()
#35 0x007221a8 in ExecutorRun ()
#36 0x00971e09 in PortalRun ()
#37 0x00966968 in exec_simple_query ()
#38 0x00969ab9 in PostgresMain ()
#39 0x008c707e in ServerLoop ()
#40 0x008c9e20 in PostmasterMain ()
#41 0x007c85af in main ()
{code}

  was:
One backend process on master had been running for several days and can't be 
terminated.
The session is idle on all segments but master instance.

pstack/strace/back trace of the backend process.

{code}
[gpadmin@avw7hdm2p1 ~]$ pstack 431263
Thread 2 (Thread 0x7f4c93aa2700 (LWP 431264)):
#0  0x7f4c9013f0d3 in poll () from /lib64/libc.so.6
#1  0x00ba8294 in rxThreadFunc ()
#2  0x7f4c9101f9d1 in start_thread () from /lib64/libpthread.so.0
#3  0x7f4c901488fd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f4c93af48e0 (LWP 431263)):
#0  0x7f4c9015805e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x7f4c900dd16b in _L_lock_9503 () from /lib64/libc.so.6
#2  0x7f4c900da6a6 in malloc () from /lib64/libc.so.6
#3  0x7f4c9008fb39 in _nl_make_l10nflist () from /lib64/libc.so.6
#4  0x7f4c9008ddf5 in _nl_find_domain () from /lib64/libc.so.6
#5  0x7f4c9008d6e0 in __dcigettext () from /lib64/libc.so.6
#6  0x7f4c6fabcfe3 in Rf_onsigusr1 () from /usr/local/lib64/R/lib/libR.so
#7  
#8  0x7f4c9014079a in brk () from /lib64/libc.so.6
#9  0x7f4c90140845 in sbrk () from /lib64/libc.so.6
#10 0x7f4c900dd769 in __default_morecore () from /lib64/libc.so.6
#11 0x7f4c900d87a2 in _int_free () from /lib64/libc.so.6
#12 0x00b3ff24 in gp_free2 ()
#13 0x00b356fc in AllocSetDelete ()
#14 0x00b38391 in MemoryContextDeleteImpl ()
#15 0x0077c851 in ExecEndAgg ()
#16 0x007592ad in ExecEndNode ()
#17 0x0075186c in ExecEndPlan ()
#18 0x0079dffa in ExecEndSubqueryScan ()
#19 0x0075921d in ExecEndNode ()
#20 0x0075186c in ExecEndPlan ()
#21 0x00752565 in ExecutorEnd ()
#22 0x006dd9bd in PortalCleanup ()
#23 0x00b3f077 in AtCommit_Portals ()
#24 0x0051abe5 in CommitTransaction ()
#25 0x0051f1d5 in CommitTransactionCommand ()
#26 0x0099809e in PostgresMain ()
#27 0x008f1031 in BackendStartup ()
#28 0x008f70e0 in PostmasterMain ()
#29 0x007f63da in main ()
[gpadmin@avw7hdm2p1 ~]$


[gpadmin@avw7hdm2p1 ~]$ strace -p 431263
Process 431263 attached - interrupt to quit
futex(0x7f4c903efe80, FUTEX_WAIT_PRIVATE, 2, NULL^C 
Process 431263 detached
[gpadmin@avw7hdm2p1 

[jira] [Updated] (HAWQ-978) long running query got hang on master and can't be terminated

2016-08-03 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI updated HAWQ-978:
-
Description: 
One backend process on master had been running for several days and can't be 
terminated.
The session is idle on all segments but master instance.

pstack/strace/back trace of the backend process.

{code}
[gpadmin@avw7hdm2p1 ~]$ pstack 431263
Thread 2 (Thread 0x7f4c93aa2700 (LWP 431264)):
#0  0x7f4c9013f0d3 in poll () from /lib64/libc.so.6
#1  0x00ba8294 in rxThreadFunc ()
#2  0x7f4c9101f9d1 in start_thread () from /lib64/libpthread.so.0
#3  0x7f4c901488fd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f4c93af48e0 (LWP 431263)):
#0  0x7f4c9015805e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x7f4c900dd16b in _L_lock_9503 () from /lib64/libc.so.6
#2  0x7f4c900da6a6 in malloc () from /lib64/libc.so.6
#3  0x7f4c9008fb39 in _nl_make_l10nflist () from /lib64/libc.so.6
#4  0x7f4c9008ddf5 in _nl_find_domain () from /lib64/libc.so.6
#5  0x7f4c9008d6e0 in __dcigettext () from /lib64/libc.so.6
#6  0x7f4c6fabcfe3 in Rf_onsigusr1 () from /usr/local/lib64/R/lib/libR.so
#7  
#8  0x7f4c9014079a in brk () from /lib64/libc.so.6
#9  0x7f4c90140845 in sbrk () from /lib64/libc.so.6
#10 0x7f4c900dd769 in __default_morecore () from /lib64/libc.so.6
#11 0x7f4c900d87a2 in _int_free () from /lib64/libc.so.6
#12 0x00b3ff24 in gp_free2 ()
#13 0x00b356fc in AllocSetDelete ()
#14 0x00b38391 in MemoryContextDeleteImpl ()
#15 0x0077c851 in ExecEndAgg ()
#16 0x007592ad in ExecEndNode ()
#17 0x0075186c in ExecEndPlan ()
#18 0x0079dffa in ExecEndSubqueryScan ()
#19 0x0075921d in ExecEndNode ()
#20 0x0075186c in ExecEndPlan ()
#21 0x00752565 in ExecutorEnd ()
#22 0x006dd9bd in PortalCleanup ()
#23 0x00b3f077 in AtCommit_Portals ()
#24 0x0051abe5 in CommitTransaction ()
#25 0x0051f1d5 in CommitTransactionCommand ()
#26 0x0099809e in PostgresMain ()
#27 0x008f1031 in BackendStartup ()
#28 0x008f70e0 in PostmasterMain ()
#29 0x007f63da in main ()
[gpadmin@avw7hdm2p1 ~]$


[gpadmin@avw7hdm2p1 ~]$ strace -p 431263
Process 431263 attached - interrupt to quit
futex(0x7f4c903efe80, FUTEX_WAIT_PRIVATE, 2, NULL^C 
Process 431263 detached
[gpadmin@avw7hdm2p1 ~]$



(gdb) thread apply all bt

Thread 2 (Thread 0x7f4c93af48e0 (LWP 431263)):
#0  0x7f4c9015805e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x7f4c900dd16b in _L_lock_9503 () from /lib64/libc.so.6
#2  0x7f4c900da6a6 in malloc () from /lib64/libc.so.6
#3  0x7f4c9008fb39 in _nl_make_l10nflist () from /lib64/libc.so.6
#4  0x7f4c9008ddf5 in _nl_find_domain () from /lib64/libc.so.6
#5  0x7f4c9008d6e0 in __dcigettext () from /lib64/libc.so.6
#6  0x7f4c6fabcfe3 in Rf_onsigusr1 (dummy=) at 
errors.c:178
#7  
#8  0x7f4c9014079a in brk () from /lib64/libc.so.6
#9  0x7f4c90140845 in sbrk () from /lib64/libc.so.6
#10 0x7f4c900dd769 in __default_morecore () from /lib64/libc.so.6
#11 0x7f4c900d87a2 in _int_free () from /lib64/libc.so.6
#12 0x00b3ff24 in gp_free2 (ptr=0x191c3b000, sz=0) at memprot.c:808
#13 0x00b356fc in AllocSetDelete (context=) at 
aset.c:981
#14 0x00b38391 in MemoryContextDeleteImpl (context=0x4a46da0, 
sfile=0x0, func=, sline=-1) at mcxt.c:232
#15 MemoryContextDeleteChildren (context=0x4a46da0, sfile=0x0, func=, sline=-1) at mcxt.c:251
#16 MemoryContextDeleteImpl (context=0x4a46da0, sfile=0x0, func=, sline=-1) at mcxt.c:205
#17 0x0077c851 in ExecEndAgg (node=0x325eb00) at nodeAgg.c:2641
#18 0x007592ad in ExecEndNode (node=0x325eb00) at execProcnode.c:1687
#19 0x0075186c in ExecEndPlan (planstate=0x325eb00, estate=0x323f9e8) 
at execMain.c:2825
#20 0x0079dffa in ExecEndSubqueryScan (node=0x325cd20) at 
nodeSubqueryscan.c:294
#21 0x0075921d in ExecEndNode (node=0x325cd20) at execProcnode.c:1638
#22 0x0075186c in ExecEndPlan (planstate=0x325cd20, estate=0x323f010) 
at execMain.c:2825
#23 0x00752565 in ExecutorEnd (queryDesc=) at 
execMain.c:1321
#24 0x006dd9bd in PortalCleanupHelper (portal=) at 
portalcmds.c:366
#25 PortalCleanup (portal=) at portalcmds.c:302
#26 0x00b3f077 in PortalDrop () at portalmem.c:402
#27 AtCommit_Portals () at portalmem.c:643
#28 0x0051abe5 in CommitTransaction () at xact.c:3379
#29 0x0051f1d5 in CommitTransactionCommand () at xact.c:4535
#30 0x0099809e in finish_xact_command (argc=, 
argv=, username=) at postgres.c:3180
#31 PostgresMain (argc=, argv=, 
username=) at postgres.c:5260
#32 0x008f1031 in BackendRun (port=0x2aa5520) at postmaster.c:6811
#33 BackendStartup (port=0x2aa5520) at postmaster.c:6408
#34 0x008f70e0 in ServerLoop (argc=, argv=) at 

[jira] [Updated] (HAWQ-978) long running query got hang on master and can't be terminated

2016-08-03 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI updated HAWQ-978:
-
Description: 
One backend process on master had been running for several days and can't be 
terminated.
The session is idle on all segments but master instance.

pstack/strace/back trace of the backend process.


[gpadmin@avw7hdm2p1 ~]$ pstack 431263
Thread 2 (Thread 0x7f4c93aa2700 (LWP 431264)):
#0  0x7f4c9013f0d3 in poll () from /lib64/libc.so.6
#1  0x00ba8294 in rxThreadFunc ()
#2  0x7f4c9101f9d1 in start_thread () from /lib64/libpthread.so.0
#3  0x7f4c901488fd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f4c93af48e0 (LWP 431263)):
#0  0x7f4c9015805e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x7f4c900dd16b in _L_lock_9503 () from /lib64/libc.so.6
#2  0x7f4c900da6a6 in malloc () from /lib64/libc.so.6
#3  0x7f4c9008fb39 in _nl_make_l10nflist () from /lib64/libc.so.6
#4  0x7f4c9008ddf5 in _nl_find_domain () from /lib64/libc.so.6
#5  0x7f4c9008d6e0 in __dcigettext () from /lib64/libc.so.6
#6  0x7f4c6fabcfe3 in Rf_onsigusr1 () from /usr/local/lib64/R/lib/libR.so
#7  
#8  0x7f4c9014079a in brk () from /lib64/libc.so.6
#9  0x7f4c90140845 in sbrk () from /lib64/libc.so.6
#10 0x7f4c900dd769 in __default_morecore () from /lib64/libc.so.6
#11 0x7f4c900d87a2 in _int_free () from /lib64/libc.so.6
#12 0x00b3ff24 in gp_free2 ()
#13 0x00b356fc in AllocSetDelete ()
#14 0x00b38391 in MemoryContextDeleteImpl ()
#15 0x0077c851 in ExecEndAgg ()
#16 0x007592ad in ExecEndNode ()
#17 0x0075186c in ExecEndPlan ()
#18 0x0079dffa in ExecEndSubqueryScan ()
#19 0x0075921d in ExecEndNode ()
#20 0x0075186c in ExecEndPlan ()
#21 0x00752565 in ExecutorEnd ()
#22 0x006dd9bd in PortalCleanup ()
#23 0x00b3f077 in AtCommit_Portals ()
#24 0x0051abe5 in CommitTransaction ()
#25 0x0051f1d5 in CommitTransactionCommand ()
#26 0x0099809e in PostgresMain ()
#27 0x008f1031 in BackendStartup ()
#28 0x008f70e0 in PostmasterMain ()
#29 0x007f63da in main ()
[gpadmin@avw7hdm2p1 ~]$


[gpadmin@avw7hdm2p1 ~]$ strace -p 431263
Process 431263 attached - interrupt to quit
futex(0x7f4c903efe80, FUTEX_WAIT_PRIVATE, 2, NULL^C 
Process 431263 detached
[gpadmin@avw7hdm2p1 ~]$



(gdb) thread apply all bt

Thread 2 (Thread 0x7f4c93af48e0 (LWP 431263)):
#0  0x7f4c9015805e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x7f4c900dd16b in _L_lock_9503 () from /lib64/libc.so.6
#2  0x7f4c900da6a6 in malloc () from /lib64/libc.so.6
#3  0x7f4c9008fb39 in _nl_make_l10nflist () from /lib64/libc.so.6
#4  0x7f4c9008ddf5 in _nl_find_domain () from /lib64/libc.so.6
#5  0x7f4c9008d6e0 in __dcigettext () from /lib64/libc.so.6
#6  0x7f4c6fabcfe3 in Rf_onsigusr1 (dummy=) at 
errors.c:178
#7  
#8  0x7f4c9014079a in brk () from /lib64/libc.so.6
#9  0x7f4c90140845 in sbrk () from /lib64/libc.so.6
#10 0x7f4c900dd769 in __default_morecore () from /lib64/libc.so.6
#11 0x7f4c900d87a2 in _int_free () from /lib64/libc.so.6
#12 0x00b3ff24 in gp_free2 (ptr=0x191c3b000, sz=0) at memprot.c:808
#13 0x00b356fc in AllocSetDelete (context=) at 
aset.c:981
#14 0x00b38391 in MemoryContextDeleteImpl (context=0x4a46da0, 
sfile=0x0, func=, sline=-1) at mcxt.c:232
#15 MemoryContextDeleteChildren (context=0x4a46da0, sfile=0x0, func=, sline=-1) at mcxt.c:251
#16 MemoryContextDeleteImpl (context=0x4a46da0, sfile=0x0, func=, sline=-1) at mcxt.c:205
#17 0x0077c851 in ExecEndAgg (node=0x325eb00) at nodeAgg.c:2641
#18 0x007592ad in ExecEndNode (node=0x325eb00) at execProcnode.c:1687
#19 0x0075186c in ExecEndPlan (planstate=0x325eb00, estate=0x323f9e8) 
at execMain.c:2825
#20 0x0079dffa in ExecEndSubqueryScan (node=0x325cd20) at 
nodeSubqueryscan.c:294
#21 0x0075921d in ExecEndNode (node=0x325cd20) at execProcnode.c:1638
#22 0x0075186c in ExecEndPlan (planstate=0x325cd20, estate=0x323f010) 
at execMain.c:2825
#23 0x00752565 in ExecutorEnd (queryDesc=) at 
execMain.c:1321
#24 0x006dd9bd in PortalCleanupHelper (portal=) at 
portalcmds.c:366
#25 PortalCleanup (portal=) at portalcmds.c:302
#26 0x00b3f077 in PortalDrop () at portalmem.c:402
#27 AtCommit_Portals () at portalmem.c:643
#28 0x0051abe5 in CommitTransaction () at xact.c:3379
#29 0x0051f1d5 in CommitTransactionCommand () at xact.c:4535
#30 0x0099809e in finish_xact_command (argc=, 
argv=, username=) at postgres.c:3180
#31 PostgresMain (argc=, argv=, 
username=) at postgres.c:5260
#32 0x008f1031 in BackendRun (port=0x2aa5520) at postmaster.c:6811
#33 BackendStartup (port=0x2aa5520) at postmaster.c:6408
#34 0x008f70e0 in ServerLoop (argc=, argv=) at 

[jira] [Created] (HAWQ-978) long running query got hang on master and can't be terminated

2016-08-03 Thread Ming LI (JIRA)
Ming LI created HAWQ-978:


 Summary: long running query got hang on master and can't be 
terminated
 Key: HAWQ-978
 URL: https://issues.apache.org/jira/browse/HAWQ-978
 Project: Apache HAWQ
  Issue Type: Bug
Reporter: Ming LI
Assignee: Lei Chang


One backend process on master had been running for several days and can't be 
terminated.
The session is idle on all segments but master instance.

pstack/strace/back trace of the backend process.

```
[gpadmin@avw7hdm2p1 ~]$ pstack 431263
Thread 2 (Thread 0x7f4c93aa2700 (LWP 431264)):
#0  0x7f4c9013f0d3 in poll () from /lib64/libc.so.6
#1  0x00ba8294 in rxThreadFunc ()
#2  0x7f4c9101f9d1 in start_thread () from /lib64/libpthread.so.0
#3  0x7f4c901488fd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f4c93af48e0 (LWP 431263)):
#0  0x7f4c9015805e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x7f4c900dd16b in _L_lock_9503 () from /lib64/libc.so.6
#2  0x7f4c900da6a6 in malloc () from /lib64/libc.so.6
#3  0x7f4c9008fb39 in _nl_make_l10nflist () from /lib64/libc.so.6
#4  0x7f4c9008ddf5 in _nl_find_domain () from /lib64/libc.so.6
#5  0x7f4c9008d6e0 in __dcigettext () from /lib64/libc.so.6
#6  0x7f4c6fabcfe3 in Rf_onsigusr1 () from /usr/local/lib64/R/lib/libR.so
#7  
#8  0x7f4c9014079a in brk () from /lib64/libc.so.6
#9  0x7f4c90140845 in sbrk () from /lib64/libc.so.6
#10 0x7f4c900dd769 in __default_morecore () from /lib64/libc.so.6
#11 0x7f4c900d87a2 in _int_free () from /lib64/libc.so.6
#12 0x00b3ff24 in gp_free2 ()
#13 0x00b356fc in AllocSetDelete ()
#14 0x00b38391 in MemoryContextDeleteImpl ()
#15 0x0077c851 in ExecEndAgg ()
#16 0x007592ad in ExecEndNode ()
#17 0x0075186c in ExecEndPlan ()
#18 0x0079dffa in ExecEndSubqueryScan ()
#19 0x0075921d in ExecEndNode ()
#20 0x0075186c in ExecEndPlan ()
#21 0x00752565 in ExecutorEnd ()
#22 0x006dd9bd in PortalCleanup ()
#23 0x00b3f077 in AtCommit_Portals ()
#24 0x0051abe5 in CommitTransaction ()
#25 0x0051f1d5 in CommitTransactionCommand ()
#26 0x0099809e in PostgresMain ()
#27 0x008f1031 in BackendStartup ()
#28 0x008f70e0 in PostmasterMain ()
#29 0x007f63da in main ()
[gpadmin@avw7hdm2p1 ~]$


[gpadmin@avw7hdm2p1 ~]$ strace -p 431263
Process 431263 attached - interrupt to quit
futex(0x7f4c903efe80, FUTEX_WAIT_PRIVATE, 2, NULL^C 
Process 431263 detached
[gpadmin@avw7hdm2p1 ~]$



(gdb) thread apply all bt

Thread 2 (Thread 0x7f4c93af48e0 (LWP 431263)):
#0  0x7f4c9015805e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x7f4c900dd16b in _L_lock_9503 () from /lib64/libc.so.6
#2  0x7f4c900da6a6 in malloc () from /lib64/libc.so.6
#3  0x7f4c9008fb39 in _nl_make_l10nflist () from /lib64/libc.so.6
#4  0x7f4c9008ddf5 in _nl_find_domain () from /lib64/libc.so.6
#5  0x7f4c9008d6e0 in __dcigettext () from /lib64/libc.so.6
#6  0x7f4c6fabcfe3 in Rf_onsigusr1 (dummy=) at 
errors.c:178
#7  
#8  0x7f4c9014079a in brk () from /lib64/libc.so.6
#9  0x7f4c90140845 in sbrk () from /lib64/libc.so.6
#10 0x7f4c900dd769 in __default_morecore () from /lib64/libc.so.6
#11 0x7f4c900d87a2 in _int_free () from /lib64/libc.so.6
#12 0x00b3ff24 in gp_free2 (ptr=0x191c3b000, sz=0) at memprot.c:808
#13 0x00b356fc in AllocSetDelete (context=) at 
aset.c:981
#14 0x00b38391 in MemoryContextDeleteImpl (context=0x4a46da0, 
sfile=0x0, func=, sline=-1) at mcxt.c:232
#15 MemoryContextDeleteChildren (context=0x4a46da0, sfile=0x0, func=, sline=-1) at mcxt.c:251
#16 MemoryContextDeleteImpl (context=0x4a46da0, sfile=0x0, func=, sline=-1) at mcxt.c:205
#17 0x0077c851 in ExecEndAgg (node=0x325eb00) at nodeAgg.c:2641
#18 0x007592ad in ExecEndNode (node=0x325eb00) at execProcnode.c:1687
#19 0x0075186c in ExecEndPlan (planstate=0x325eb00, estate=0x323f9e8) 
at execMain.c:2825
#20 0x0079dffa in ExecEndSubqueryScan (node=0x325cd20) at 
nodeSubqueryscan.c:294
#21 0x0075921d in ExecEndNode (node=0x325cd20) at execProcnode.c:1638
#22 0x0075186c in ExecEndPlan (planstate=0x325cd20, estate=0x323f010) 
at execMain.c:2825
#23 0x00752565 in ExecutorEnd (queryDesc=) at 
execMain.c:1321
#24 0x006dd9bd in PortalCleanupHelper (portal=) at 
portalcmds.c:366
#25 PortalCleanup (portal=) at portalcmds.c:302
#26 0x00b3f077 in PortalDrop () at portalmem.c:402
#27 AtCommit_Portals () at portalmem.c:643
#28 0x0051abe5 in CommitTransaction () at xact.c:3379
#29 0x0051f1d5 in CommitTransactionCommand () at xact.c:4535
#30 0x0099809e in finish_xact_command (argc=, 
argv=, username=) at postgres.c:3180
#31 PostgresMain (argc=, argv=, 
username=) at postgres.c:5260
#32 0x008f1031 in 

[jira] [Updated] (HAWQ-925) Set default locale, timezone & datastyle before running sql command/file

2016-07-15 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI updated HAWQ-925:
-
Fix Version/s: (was: 2.0.0.0-incubating)
   2.0.1.0-incubating

> Set default locale, timezone & datastyle before running sql command/file
> 
>
> Key: HAWQ-925
> URL: https://issues.apache.org/jira/browse/HAWQ-925
> Project: Apache HAWQ
>  Issue Type: Bug
>Reporter: Paul Guo
>Assignee: Paul Guo
> Fix For: 2.0.1.0-incubating
>
>
> So that sql output could be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HAWQ-901) hawq init failed: hawqstandbywatch.py:test5:gpadmin-[WARNING]:-syncmaster not running

2016-07-06 Thread Ming LI (JIRA)
Ming LI created HAWQ-901:


 Summary: hawq init failed: 
hawqstandbywatch.py:test5:gpadmin-[WARNING]:-syncmaster not running
 Key: HAWQ-901
 URL: https://issues.apache.org/jira/browse/HAWQ-901
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Command Line Tools
Reporter: Ming LI
Assignee: Lei Chang


Error message in ~/hawqAdminLogs/hawq_init_.log

20160706:06:45:53:006218 hawq_start:test1:gpadmin-[INFO]:-Start hawq with args: 
['start', 'standby']
20160706:06:45:53:006218 hawq_start:test1:gpadmin-[INFO]:-Gathering information 
and validating the environment...
20160706:06:45:53:006218 hawq_start:test1:gpadmin-[INFO]:-Start standby master 
service
20160706:06:46:02:006218 hawq_start:test1:gpadmin-[INFO]:-Checking standby 
master status
20160706:06:45:55:004418 hawqstandbywatch.py:test5:gpadmin-[INFO]:-Monitoring 
logs
20160706:06:46:00:004418 hawqstandbywatch.py:test5:gpadmin-[INFO]:-checking if 
syncmaster is running
20160706:06:46:02:004418 
hawqstandbywatch.py:test5:gpadmin-[WARNING]:-syncmaster not running
20160706:06:46:02:006218 hawq_start:test1:gpadmin-[ERROR]:-Standby master start 
failed, exit
20160706:06:46:02:003999 hawqinit.sh:test5:gpadmin-[ERROR]:-Start HAWQ standby 
failed
--

(1) I suspect the root cause maybe: we only wait 5 seconds before we check 
standby running status, this interval is too small.  Could you please firstly 
change the standby running status check interval from 5 seconds to a loop like 
recovery running status check on master? 

(2) If the error 'syncmaster not running' will lead to init failure, we should 
change from [WARNING] to [ERROR]. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HAWQ-812) Activate standby master failed after create a new database

2016-07-03 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI resolved HAWQ-812.
--
Resolution: Fixed
  Assignee: Ming LI  (was: Lei Chang)

> Activate standby master failed after create a new database
> --
>
> Key: HAWQ-812
> URL: https://issues.apache.org/jira/browse/HAWQ-812
> Project: Apache HAWQ
>  Issue Type: Bug
>Reporter: Chunling Wang
>Assignee: Ming LI
>
> Activate standby master failed after create a new database. However, it will 
> success if we do not create a new database even we create a new table and 
> insert data. 
> 1. Create a new database 'gptest'
> {code}
> [gpadmin@test1 ~]$ psql -l
>  List of databases
>Name|  Owner  | Encoding | Access privileges
> ---+-+--+---
>  postgres  | gpadmin | UTF8 |
>  template0 | gpadmin | UTF8 |
>  template1 | gpadmin | UTF8 |
> (3 rows)
> [gpadmin@test1 ~]$ createdb gptest
> [gpadmin@test1 ~]$ psql -l
>  List of databases
>Name|  Owner  | Encoding | Access privileges
> ---+-+--+---
>  gptest| gpadmin | UTF8 |
>  postgres  | gpadmin | UTF8 |
>  template0 | gpadmin | UTF8 |
>  template1 | gpadmin | UTF8 |
> (4 rows)
> {code}
> 2. Stop HAWQ master
> {code}
> [gpadmin@test1 ~]$ hawq stop master -a
> 20160613:20:13:44:068559 hawq_stop:test1:gpadmin-[INFO]:-Prepare to do 'hawq 
> stop'
> 20160613:20:13:44:068559 hawq_stop:test1:gpadmin-[INFO]:-You can find log in:
> 20160613:20:13:44:068559 
> hawq_stop:test1:gpadmin-[INFO]:-/home/gpadmin/hawqAdminLogs/hawq_stop_20160613.log
> 20160613:20:13:44:068559 hawq_stop:test1:gpadmin-[INFO]:-GPHOME is set to:
> 20160613:20:13:44:068559 
> hawq_stop:test1:gpadmin-[INFO]:-/data/pulse-agent-data/HAWQ-main-FeatureTest-opt-mutilnodeparallel-wcl/product/hawq/.
> 20160613:20:13:44:068559 hawq_stop:test1:gpadmin-[INFO]:-Stop hawq with args: 
> ['stop', 'master']
> 20160613:20:13:45:068559 hawq_stop:test1:gpadmin-[INFO]:-There are 0 
> connections to the database
> 20160613:20:13:45:068559 hawq_stop:test1:gpadmin-[INFO]:-Commencing Master 
> instance shutdown with mode='smart'
> 20160613:20:13:45:068559 hawq_stop:test1:gpadmin-[INFO]:-Master host=test1
> 20160613:20:13:45:068559 hawq_stop:test1:gpadmin-[INFO]:-Stop hawq master
> 20160613:20:13:46:068559 hawq_stop:test1:gpadmin-[INFO]:-Master stopped 
> successfully
> {code}
> 3. Activate standby master
> {code}
> [gpadmin@test1 ~]$ ssh test5 'source 
> /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-mutilnodeparallel-wcl/product/hawq/./greenplum_path.sh;
>  hawq activate standby -a'
> 20160613:20:14:14:126841 hawq_activate:test5:gpadmin-[INFO]:-Prepare to do 
> 'hawq activate'
> 20160613:20:14:14:126841 hawq_activate:test5:gpadmin-[INFO]:-You can find log 
> in:
> 20160613:20:14:14:126841 
> hawq_activate:test5:gpadmin-[INFO]:-/home/gpadmin/hawqAdminLogs/hawq_activate_20160613.log
> 20160613:20:14:14:126841 hawq_activate:test5:gpadmin-[INFO]:-GPHOME is set to:
> 20160613:20:14:14:126841 
> hawq_activate:test5:gpadmin-[INFO]:-/data/pulse-agent-data/HAWQ-main-FeatureTest-opt-mutilnodeparallel-wcl/product/hawq/.
> 20160613:20:14:14:126841 hawq_activate:test5:gpadmin-[INFO]:-Activate hawq 
> with args: ['activate', 'standby']
> 20160613:20:14:14:126841 hawq_activate:test5:gpadmin-[INFO]:-Starting to 
> activate standby master 'test5'
> 20160613:20:14:15:126841 hawq_activate:test5:gpadmin-[INFO]:-HAWQ master is 
> not running, skip
> 20160613:20:14:15:126841 hawq_activate:test5:gpadmin-[INFO]:-Stopping all the 
> running segments
> 20160613:20:14:21:126841 hawq_activate:test5:gpadmin-[INFO]:-
> 20160613:20:14:21:126841 hawq_activate:test5:gpadmin-[INFO]:-Stopping running 
> standby
> 20160613:20:14:23:126841 hawq_activate:test5:gpadmin-[INFO]:-Update master 
> host name in hawq-site.xml
> 20160613:20:14:31:126841 hawq_activate:test5:gpadmin-[INFO]:-GUC 
> hawq_master_address_host already exist in hawq-site.xml
> Update it with value: test5
> 20160613:20:14:31:126841 hawq_activate:test5:gpadmin-[INFO]:-Remove current 
> standby from hawq-site.xml
> 20160613:20:14:39:126841 hawq_activate:test5:gpadmin-[INFO]:-Start master in 
> master only mode
> {code}
> It hangs and can not start master. And the master log is following:
> {code}
> 2016-06-13 20:14:40.268022 
> PDT,,,p127518,th-12124628160,,,seg-1,"LOG","0","database 
> system was shut down at 2016-06-13 20:02:50 PDT",,,0,,"xlog.c",6205,
> 2016-06-13 20:14:40.268112 
> PDT,,,p127518,th-12124628160,,,seg-1,"LOG","0","found 
> recovery.conf file indicating standby takeover recovery 
> needed",,,0,,"xlog.c",5485,
> 2016-06-13 20:14:40.268131 
> 

  1   2   >