[jira] [Commented] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-11 Thread Lukas Nalezenec (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13897820#comment-13897820
 ] 

Lukas Nalezenec commented on HBASE-10413:
-

One more think:
There is some versioning in class TableSplit (methods write & read). 
We dont need to increment it ?
(I am just asking)

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
> Fix For: 0.98.1, 0.99.0
>
> Attachments: 10413-7.patch, HBASE-10413-2.patch, HBASE-10413-3.patch, 
> HBASE-10413-4.patch, HBASE-10413-5.patch, HBASE-10413-6.patch, 
> HBASE-10413.patch
>
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-11 Thread Lukas Nalezenec (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13897700#comment-13897700
 ] 

Lukas Nalezenec commented on HBASE-10413:
-

Hi, thank you very much for your time.

I need one small change. Its not critical but it will make considerable 
difference in user experience.

My line
LOG.info(MessageFormat.format("Input split length: {0} bytes.", 
tSplit.getLength()));
was changed to 
LOG.info("Input split length: " + tSplit.getLength() + " bytes.");
in last code review.

The reason why i used MessageFormat.format is that the length is large number 
and it needs to be printed with thousands separator.

It takes few seconds to read number 
54798765321
How fast can you say if the number represents 5.4 TB or 5.4 GB ?

but if you print it with separators you can correctly read it in a moment:
54,798,765,321

Can we add some formatting consistent with hbase coding standards ? Maybe 
String.format i dont know.

Lukas

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
> Fix For: 0.98.1, 0.99.0
>
> Attachments: 10413-7.patch, HBASE-10413-2.patch, HBASE-10413-3.patch, 
> HBASE-10413-4.patch, HBASE-10413-5.patch, HBASE-10413-6.patch, 
> HBASE-10413.patch
>
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-10 Thread Lukas Nalezenec (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13896865#comment-13896865
 ] 

Lukas Nalezenec commented on HBASE-10413:
-

It would be great.



> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
> Attachments: HBASE-10413-2.patch, HBASE-10413-3.patch, 
> HBASE-10413-4.patch, HBASE-10413-5.patch, HBASE-10413-6.patch, 
> HBASE-10413.patch
>
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-10 Thread Lukas Nalezenec (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13896438#comment-13896438
 ] 

Lukas Nalezenec commented on HBASE-10413:
-

I have removed  setLength() from TableSplit.
Unit tests are green, I would like to resolve this ticket.

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
> Attachments: HBASE-10413-2.patch, HBASE-10413-3.patch, 
> HBASE-10413-4.patch, HBASE-10413-5.patch, HBASE-10413-6.patch, 
> HBASE-10413.patch
>
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-10 Thread Lukas Nalezenec (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukas Nalezenec updated HBASE-10413:


Attachment: HBASE-10413-6.patch

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
> Attachments: HBASE-10413-2.patch, HBASE-10413-3.patch, 
> HBASE-10413-4.patch, HBASE-10413-5.patch, HBASE-10413-6.patch, 
> HBASE-10413.patch
>
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-08 Thread Lukas Nalezenec (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukas Nalezenec updated HBASE-10413:


Attachment: HBASE-10413-5.patch

fix after code review.
TableSplit still contains setLength()


> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
> Attachments: HBASE-10413-2.patch, HBASE-10413-3.patch, 
> HBASE-10413-4.patch, HBASE-10413-5.patch, HBASE-10413.patch
>
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-08 Thread Lukas Nalezenec (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895729#comment-13895729
 ] 

Lukas Nalezenec commented on HBASE-10413:
-

Lets make RegionSizeCalculator @InterfaceAudience.Private. Users are not 
expected to directly call this, right?
 - I am not sure - I have no experience with using this interface 
InterfaceAudience. Lot of developers are using heavily customized 
TableInputFormat. They may want to use this class.  I have changed it to 
Private (Btw: I was told to change it from Private to Public in previous code 
review ).

Instead of TableSplit.setLength(), you can override the ctor. TableSplit acts 
like a immutable data bean like object.
 - It means there will be ctor with 6 parameters. IMO it is too much but if you 
really want me to do it I will.

 On some cases, the regions might split or merge concurrently between getting 
the startEndKeys and asking the regions from cluster. In this case, for that 
range, we might default to 0, but it should be ok I think. We are not just 
estimating the region sizes here.
 - I think its not worth doing - it will be rare and the difference will be 
insignificant most times.


> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
> Attachments: HBASE-10413-2.patch, HBASE-10413-3.patch, 
> HBASE-10413-4.patch, HBASE-10413.patch
>
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-07 Thread Lukas Nalezenec (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukas Nalezenec updated HBASE-10413:


Attachment: HBASE-10413-4.patch

Calculator works only with store file size, not memstore size

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
> Attachments: HBASE-10413-2.patch, HBASE-10413-3.patch, 
> HBASE-10413-4.patch, HBASE-10413.patch
>
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-07 Thread Lukas Nalezenec (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13894845#comment-13894845
 ] 

Lukas Nalezenec commented on HBASE-10413:
-

ok, memstore size removed.

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
> Attachments: HBASE-10413-2.patch, HBASE-10413-3.patch, 
> HBASE-10413-4.patch, HBASE-10413.patch
>
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-07 Thread Lukas Nalezenec (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13894790#comment-13894790
 ] 

Lukas Nalezenec commented on HBASE-10413:
-

Ad:

+  long regionSizeBytes = (memSize + fileSize) * megaByte;

Does memstore size have to be included ?

I am not sure. What are cons and pros ?

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
> Attachments: HBASE-10413-2.patch, HBASE-10413-3.patch, 
> HBASE-10413.patch
>
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-07 Thread Lukas Nalezenec (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukas Nalezenec updated HBASE-10413:


Attachment: HBASE-10413-3.patch

code review

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
> Attachments: HBASE-10413-2.patch, HBASE-10413-3.patch, 
> HBASE-10413.patch
>
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-07 Thread Lukas Nalezenec (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukas Nalezenec updated HBASE-10413:


Release Note: TableSplit.getLength() contains correct sizes of region in 
bytes. It is used by M/R framework for better scheduling.
  Status: Patch Available  (was: In Progress)

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
> Attachments: HBASE-10413-2.patch, HBASE-10413.patch
>
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-07 Thread Lukas Nalezenec (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukas Nalezenec updated HBASE-10413:


Attachment: HBASE-10413-2.patch

latest patch with unit test category added

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
> Attachments: HBASE-10413-2.patch, HBASE-10413.patch
>
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-07 Thread Lukas Nalezenec (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukas Nalezenec updated HBASE-10413:


Attachment: HBASE-10413.patch

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
> Attachments: HBASE-10413.patch
>
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Work started] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-06 Thread Lukas Nalezenec (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-10413 started by Lukas Nalezenec.

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-04 Thread Lukas Nalezenec (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890559#comment-13890559
 ] 

Lukas Nalezenec commented on HBASE-10413:
-

New version with per table region filtering:
https://github.com/apache/hbase/pull/8/files#diff-46ff60f1e27e3d77131acb7873050990R76

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-04 Thread Lukas Nalezenec (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890550#comment-13890550
 ] 

Lukas Nalezenec commented on HBASE-10413:
-

Hi,
I know it is hacky. It is my first hbase commit, i was not sure how to do it so 
I asked 3 people and then published first draft as soon as possible. Everybody 
was fine with the solution :( .

The hacky solution is good enough for us - I have already deployed it 
yesterday.  I cant spent much more time on this. I need to close it by tomorrow.

How about this solution? I am not sure if it is the best way - it does not work 
with Scan ranges.

ToDos:
We need to filter regions by table
It would be nice to if we could filter size by column families.


https://github.com/apache/hbase/pull/8/files#diff-46ff60f1e27e3d77131acb7873050990R68


   HBaseAdmin admin = new HBaseAdmin(configuration);

ClusterStatus clusterStatus = admin.getClusterStatus();
Collection servers = clusterStatus.getServers();

for (ServerName serverName: servers) {
  ServerLoad serverLoad = clusterStatus.getLoad(serverName);

  for (Map.Entry regionEntry: 
serverLoad.getRegionsLoad().entrySet()) {
byte[] regionId = regionEntry.getKey();
RegionLoad regionLoad = regionEntry.getValue();

long regionSize = 1024 * 1024 * (regionLoad.getMemStoreSizeMB() + 
regionLoad.getStorefileSizeMB());

sizeMap.put(regionId, regionSize);
  }
}

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10413) Tablesplit.getLength returns 0

2014-02-03 Thread Lukas Nalezenec (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13889593#comment-13889593
 ] 

Lukas Nalezenec commented on HBASE-10413:
-

I made big changes in code.
You can check it and discus it in https://github.com/apache/hbase/pull/8/files .

 I have to write unit tests before making the patch.

- I need help with unit test. Is there some simple unit test helper/utility i 
can use ? I need to create table with some regions and then work with their 
sizes. It should be local, there should be some level of abstraction. 

- I have added configuration option for disabling this feature:
  Is there some policy about new configuration options ? 
  Should i move the configuration key constant to some place ? 
  Should be the feature disabled or enabled by default ?

- Computation of region sizes might be slow. We might need some parallelization.

from mail:
+  public void setLength(long length) {
This method in TableSplit can be package private.

I think that lot of people uses Table Split in their custom Input format. IMHO 
this method should be part of API.

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>Assignee: Lukas Nalezenec
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10413) Tablesplit.getLength returns 0

2014-01-31 Thread Lukas Nalezenec (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887817#comment-13887817
 ] 

Lukas Nalezenec commented on HBASE-10413:
-

first draft:
https://github.com/lukasnalezenec/hbase/commit/bf560b3c19b15cefb114132ac86664ffc44dad32

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10413) Tablesplit.getLength returns 0

2014-01-27 Thread Lukas Nalezenec (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882737#comment-13882737
 ] 

Lukas Nalezenec commented on HBASE-10413:
-

I talked with guy who worked on this and he said our production issue was 
probably not directly caused by getLength() returning 0. 
Anyway, we are interested in fixing this. 
See updated ticket description.

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10413) Tablesplit.getLength returns 0

2014-01-27 Thread Lukas Nalezenec (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukas Nalezenec updated HBASE-10413:


Description: 
InputSplits should be sorted by length but TableSplit does not contain real 
getLength implementation:

  @Override
  public long getLength() {
// Not clear how to obtain this... seems to be used only for sorting splits
return 0;
  }

This is causing us problem with scheduling - we have got jobs that are supposed 
to finish in limited time but they get often stuck in last mapper working on 
large region.

Can we implement this method ? 
What is the best way ?

We were thinking about estimating size by size of files on HDFS.
We would like to get Scanner from TableSplit, use startRow, stopRow and column 
families to get corresponding region than computing size of HDFS for given 
region and column family. 


Update:
This ticket was about production issue - I talked with guy who worked on this 
and he said our production issue was probably not directly caused by 
getLength() returning 0. 

  was:
InputSplits should be sorted by length but TableSplit does not contain real 
getLength implementation:

  @Override
  public long getLength() {
// Not clear how to obtain this... seems to be used only for sorting splits
return 0;
  }

This is causing us problem with scheduling - we have got jobs that are supposed 
to finish in limited time but they get often stuck in last mapper working on 
large region.

Can we implement this method ? 
What is the best way ?

We were thinking about estimating size by size of files on HDFS.
We would like to get Scanner from TableSplit, use startRow, stopRow and column 
families to get corresponding region than computing size of HDFS for given 
region and column family. 


Update:
This ticket talked about production issue - I talked with guy who worked on 
this and he said our production issue was probably not directly caused by 
getLength() returning 0. 


> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10413) Tablesplit.getLength returns 0

2014-01-27 Thread Lukas Nalezenec (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lukas Nalezenec updated HBASE-10413:


Description: 
InputSplits should be sorted by length but TableSplit does not contain real 
getLength implementation:

  @Override
  public long getLength() {
// Not clear how to obtain this... seems to be used only for sorting splits
return 0;
  }

This is causing us problem with scheduling - we have got jobs that are supposed 
to finish in limited time but they get often stuck in last mapper working on 
large region.

Can we implement this method ? 
What is the best way ?

We were thinking about estimating size by size of files on HDFS.
We would like to get Scanner from TableSplit, use startRow, stopRow and column 
families to get corresponding region than computing size of HDFS for given 
region and column family. 


Update:
This ticket talked about production issue - I talked with guy who worked on 
this and he said our production issue was probably not directly caused by 
getLength() returning 0. 

  was:
We had serious issue in our production today.

InputSplits should be sorted by length but TableSplit does not contain real 
getLength implementation:

  @Override
  public long getLength() {
// Not clear how to obtain this... seems to be used only for sorting splits
return 0;
  }

Can we implement this method ? 
What is the best way ?

Summary: Tablesplit.getLength returns 0  (was: TableSplits are not 
sorted by size.)

> Tablesplit.getLength returns 0
> --
>
> Key: HBASE-10413
> URL: https://issues.apache.org/jira/browse/HBASE-10413
> Project: HBase
>  Issue Type: Bug
>  Components: Client, mapreduce
>Affects Versions: 0.96.1.1
>Reporter: Lukas Nalezenec
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
> // Not clear how to obtain this... seems to be used only for sorting 
> splits
> return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket talked about production issue - I talked with guy who worked on 
> this and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HBASE-10413) TableSplits are not sorted by size.

2014-01-24 Thread Lukas Nalezenec (JIRA)
Lukas Nalezenec created HBASE-10413:
---

 Summary: TableSplits are not sorted by size.
 Key: HBASE-10413
 URL: https://issues.apache.org/jira/browse/HBASE-10413
 Project: HBase
  Issue Type: Bug
  Components: Client
Affects Versions: 0.96.1.1
Reporter: Lukas Nalezenec


We had serious issue in our production today.

InputSplits should be sorted by length but TableSplit does not contain real 
getLength implementation:

  @Override
  public long getLength() {
// Not clear how to obtain this... seems to be used only for sorting splits
return 0;
  }

Can we implement this method ? 
What is the best way ?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)