[jira] Commented: (PIG-1029) HBaseStorage is way too slow to be usable

2010-02-11 Thread Vincent BARAT (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832707#action_12832707
 ] 

Vincent BARAT commented on PIG-1029:


OK, I've found the parameter. Nevertheless, as I previously stated, "even if 
this cache size can be configured globally using configuration files, I think 
the HBaseStorage() should take an additional parameters (optional maybe) 
allowing to set the cache size for the scanned table."

Don't you think so ? Do you think it's worth having it available in the 
HBaseStorage() call ? My point is that you can have tables with very large rows 
and others with very small rows, making the use of the 
hbase.client.scanner.caching parameter at config file level non usable, and a 
way to set it at PIG level very useful.

> HBaseStorage is way too slow to be usable
> -
>
> Key: PIG-1029
> URL: https://issues.apache.org/jira/browse/PIG-1029
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Vincent BARAT
>
> I have performed a set of benchmarks on HBaseStorage loader, using PIG 0.4.0 
> and HBase 0.20.0 (using the patch referred in 
> https://issues.apache.org/jira/browse/PIG-970) and Hadoop 0.20.0.
> The HBaseStorage loader is basically 10x slower than the PigStorage loader.
> To bypass this limitation, I had to read my HBase tables, write them to a 
> Hadoop file and then use this file as input for my subsequent computations.
> I report this bug for the track, I will try to sse if I can optimise this a 
> bit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1029) HBaseStorage is way too slow to be usable

2010-02-11 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832574#action_12832574
 ] 

Jeff Zhang commented on PIG-1029:
-

Vincent, you can change the caching size by set hbase.client.scanner.caching in 
hbase-site.xml
If not set, the default value is 1



> HBaseStorage is way too slow to be usable
> -
>
> Key: PIG-1029
> URL: https://issues.apache.org/jira/browse/PIG-1029
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Vincent BARAT
>
> I have performed a set of benchmarks on HBaseStorage loader, using PIG 0.4.0 
> and HBase 0.20.0 (using the patch referred in 
> https://issues.apache.org/jira/browse/PIG-970) and Hadoop 0.20.0.
> The HBaseStorage loader is basically 10x slower than the PigStorage loader.
> To bypass this limitation, I had to read my HBase tables, write them to a 
> Hadoop file and then use this file as input for my subsequent computations.
> I report this bug for the track, I will try to sse if I can optimise this a 
> bit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1029) HBaseStorage is way too slow to be usable

2010-02-10 Thread Vincent BARAT (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832075#action_12832075
 ] 

Vincent BARAT commented on PIG-1029:


OK, I got the answer: the HBase scanner used to load the HBase table is using 
the default HBase caching policy (see HTable.setScannerCaching())
For me it is set to 1 (and I don't know if I can change this using HBase config 
files). If I set it to, say 1000, by modifying HBaseSlicer(), the load time is 
x10 faster.

Of course, the cache size depends on the size of the table rows, and thus it is 
not possible to hard code a value in HBaseSlicer().

Even if this cache size can be configured globally using configuration files, I 
think the HBaseStorage() should take an additional parameters (optional maybe) 
allowing to set the cache size for the scanned table.

What I propose, if you agree, is to do the patch and submit it for integration 
in PIG.

> HBaseStorage is way too slow to be usable
> -
>
> Key: PIG-1029
> URL: https://issues.apache.org/jira/browse/PIG-1029
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Vincent BARAT
>
> I have performed a set of benchmarks on HBaseStorage loader, using PIG 0.4.0 
> and HBase 0.20.0 (using the patch referred in 
> https://issues.apache.org/jira/browse/PIG-970) and Hadoop 0.20.0.
> The HBaseStorage loader is basically 10x slower than the PigStorage loader.
> To bypass this limitation, I had to read my HBase tables, write them to a 
> Hadoop file and then use this file as input for my subsequent computations.
> I report this bug for the track, I will try to sse if I can optimise this a 
> bit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1029) HBaseStorage is way too slow to be usable

2009-10-29 Thread Vincent BARAT (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771444#action_12771444
 ] 

Vincent BARAT commented on PIG-1029:


I have a small cluster of 1 master node + 3 MR nodes (virtual nodes) connected 
through a gigabit switch (network connexion if fast).
Each MR node runs also a HBase region server and zookeeper.

What I have noticed if that the HBase data is not always read from the local 
node  (according to the Hadoop web frontend). Most of the time the data is read 
from another node.

Anyway, I don't think that the slowness comes from this, I suspect 2 things:

1) reading from Hbase is just far slower than reading from a hadoop file
2) converting hbase records to PIG tuples (what is done in HBaseSlice object) 
is slow (this is a bunch of object instantiation)

Unfortunately, I have not performed additional test to figure out what is the 
exact reason.

> HBaseStorage is way too slow to be usable
> -
>
> Key: PIG-1029
> URL: https://issues.apache.org/jira/browse/PIG-1029
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Vincent BARAT
>
> I have performed a set of benchmarks on HBaseStorage loader, using PIG 0.4.0 
> and HBase 0.20.0 (using the patch referred in 
> https://issues.apache.org/jira/browse/PIG-970) and Hadoop 0.20.0.
> The HBaseStorage loader is basically 10x slower than the PigStorage loader.
> To bypass this limitation, I had to read my HBase tables, write them to a 
> Hadoop file and then use this file as input for my subsequent computations.
> I report this bug for the track, I will try to sse if I can optimise this a 
> bit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1029) HBaseStorage is way too slow to be usable

2009-10-28 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770875#action_12770875
 ] 

Jeff Zhang commented on PIG-1029:
-

Vincent, what environment do you use to get the performance comparison ?

> HBaseStorage is way too slow to be usable
> -
>
> Key: PIG-1029
> URL: https://issues.apache.org/jira/browse/PIG-1029
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Vincent BARAT
>
> I have performed a set of benchmarks on HBaseStorage loader, using PIG 0.4.0 
> and HBase 0.20.0 (using the patch referred in 
> https://issues.apache.org/jira/browse/PIG-970) and Hadoop 0.20.0.
> The HBaseStorage loader is basically 10x slower than the PigStorage loader.
> To bypass this limitation, I had to read my HBase tables, write them to a 
> Hadoop file and then use this file as input for my subsequent computations.
> I report this bug for the track, I will try to sse if I can optimise this a 
> bit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.