date:20110701

[jira] [Updated] (CASSANDRA-2843) better performance on long row read

2011-07-01 Thread Yang Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Yang updated CASSANDRA-2843:
-

Attachment: fast_cf.diff

diff file

 better performance on long row read
 ---

 Key: CASSANDRA-2843
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2843
 Project: Cassandra
  Issue Type: New Feature
Reporter: Yang Yang
 Attachments: fast_cf.diff


 currently if a row contains  1000 columns, the run time becomes considerably 
 slow (my test of 
 a row with 30 00 columns (standard, regular) each with 8 bytes in name, and 
 40 bytes in value, is about 16ms.
 this is all running in memory, no disk read is involved.
 through debugging we can find
 most of this time is spent on 
 [Wall Time]  org.apache.cassandra.db.Table.getRow(QueryFilter)
 [Wall Time]  
 org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(QueryFilter, 
 ColumnFamily)
 [Wall Time]  
 org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(QueryFilter, int, 
 ColumnFamily)
 [Wall Time]  
 org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(QueryFilter, 
 int, ColumnFamily)
 [Wall Time]  
 org.apache.cassandra.db.filter.QueryFilter.collectCollatedColumns(ColumnFamily,
  Iterator, int)
 [Wall Time]  
 org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(IColumnContainer,
  Iterator, int)
 [Wall Time]  org.apache.cassandra.db.ColumnFamily.addColumn(IColumn)
 ColumnFamily.addColumn() is slow because it inserts into an internal 
 concurrentSkipListMap() that maps column names to values.
 this structure is slow for two reasons: it needs to do synchronization; it 
 needs to maintain a more complex structure of map.
 but if we look at the whole read path, thrift already defines the read output 
 to be ListColumnOrSuperColumn so it does not make sense to use a luxury map 
 data structure in the interium and finally convert it to a list. on the 
 synchronization side, since the return CF is never going to be 
 shared/modified by other threads, we know the access is always single thread, 
 so no synchronization is needed.
 but these 2 features are indeed needed for ColumnFamily in other cases, 
 particularly write. so we can provide a different ColumnFamily to 
 CFS.getTopLevelColumnFamily(), so getTopLevelColumnFamily no longer always 
 creates the standard ColumnFamily, but take a provided returnCF, whose cost 
 is much cheaper.
 the provided patch is for demonstration now, will work further once we agree 
 on the general direction. 
 CFS, ColumnFamily, and Table  are changed; a new FastColumnFamily is 
 provided. the main work is to let the FastColumnFamily use an array  for 
 internal storage. at first I used binary search to insert new columns in 
 addColumn(), but later I found that even this is not necessary, since all 
 calling scenarios of ColumnFamily.addColumn() has an invariant that the 
 inserted columns come in sorted order (I still have an issue to resolve 
 descending or ascending  now, but ascending works). so the current logic is 
 simply to compare the new column against the end column in the array, if 
 names not equal, append, if equal, reconcile.
 slight temporary hacks are made on getTopLevelColumnFamily so we have 2 
 flavors of the method, one accepting a returnCF. but we could definitely 
 think about what is the better way to provide this returnCF.
 this patch compiles fine, no tests are provided yet. but I tested it in my 
 application, and the performance improvement is dramatic: it offers about 50% 
 reduction in read time in the 3000-column case.
 thanks
 Yang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (CASSANDRA-2843) better performance on long row read

2011-07-01 Thread Yang Yang (JIRA)

better performance on long row read
---

 Key: CASSANDRA-2843
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2843
 Project: Cassandra
  Issue Type: New Feature
Reporter: Yang Yang
 Attachments: fast_cf.diff

currently if a row contains  1000 columns, the run time becomes considerably 
slow (my test of 
a row with 30 00 columns (standard, regular) each with 8 bytes in name, and 40 
bytes in value, is about 16ms.
this is all running in memory, no disk read is involved.

through debugging we can find
most of this time is spent on 
[Wall Time]  org.apache.cassandra.db.Table.getRow(QueryFilter)
[Wall Time]  
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(QueryFilter, 
ColumnFamily)
[Wall Time]  
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(QueryFilter, int, 
ColumnFamily)
[Wall Time]  
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(QueryFilter, int, 
ColumnFamily)
[Wall Time]  
org.apache.cassandra.db.filter.QueryFilter.collectCollatedColumns(ColumnFamily, 
Iterator, int)
[Wall Time]  
org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(IColumnContainer,
 Iterator, int)
[Wall Time]  org.apache.cassandra.db.ColumnFamily.addColumn(IColumn)

ColumnFamily.addColumn() is slow because it inserts into an internal 
concurrentSkipListMap() that maps column names to values.
this structure is slow for two reasons: it needs to do synchronization; it 
needs to maintain a more complex structure of map.

but if we look at the whole read path, thrift already defines the read output 
to be ListColumnOrSuperColumn so it does not make sense to use a luxury map 
data structure in the interium and finally convert it to a list. on the 
synchronization side, since the return CF is never going to be shared/modified 
by other threads, we know the access is always single thread, so no 
synchronization is needed.

but these 2 features are indeed needed for ColumnFamily in other cases, 
particularly write. so we can provide a different ColumnFamily to 
CFS.getTopLevelColumnFamily(), so getTopLevelColumnFamily no longer always 
creates the standard ColumnFamily, but take a provided returnCF, whose cost is 
much cheaper.

the provided patch is for demonstration now, will work further once we agree on 
the general direction. 
CFS, ColumnFamily, and Table  are changed; a new FastColumnFamily is provided. 
the main work is to let the FastColumnFamily use an array  for internal 
storage. at first I used binary search to insert new columns in addColumn(), 
but later I found that even this is not necessary, since all calling scenarios 
of ColumnFamily.addColumn() has an invariant that the inserted columns come in 
sorted order (I still have an issue to resolve descending or ascending  now, 
but ascending works). so the current logic is simply to compare the new column 
against the end column in the array, if names not equal, append, if equal, 
reconcile.

slight temporary hacks are made on getTopLevelColumnFamily so we have 2 flavors 
of the method, one accepting a returnCF. but we could definitely think about 
what is the better way to provide this returnCF.


this patch compiles fine, no tests are provided yet. but I tested it in my 
application, and the performance improvement is dramatic: it offers about 50% 
reduction in read time in the 3000-column case.


thanks
Yang


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2843) better performance on long row read

2011-07-01 Thread Yang Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Yang updated CASSANDRA-2843:
-

Attachment: b.tar.gz

just untar this file into the 0.8.0-rc1  source tree, then compile

 better performance on long row read
 ---

 Key: CASSANDRA-2843
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2843
 Project: Cassandra
  Issue Type: New Feature
Reporter: Yang Yang
 Attachments: b.tar.gz, fast_cf.diff


 currently if a row contains  1000 columns, the run time becomes considerably 
 slow (my test of 
 a row with 30 00 columns (standard, regular) each with 8 bytes in name, and 
 40 bytes in value, is about 16ms.
 this is all running in memory, no disk read is involved.
 through debugging we can find
 most of this time is spent on 
 [Wall Time]  org.apache.cassandra.db.Table.getRow(QueryFilter)
 [Wall Time]  
 org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(QueryFilter, 
 ColumnFamily)
 [Wall Time]  
 org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(QueryFilter, int, 
 ColumnFamily)
 [Wall Time]  
 org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(QueryFilter, 
 int, ColumnFamily)
 [Wall Time]  
 org.apache.cassandra.db.filter.QueryFilter.collectCollatedColumns(ColumnFamily,
  Iterator, int)
 [Wall Time]  
 org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(IColumnContainer,
  Iterator, int)
 [Wall Time]  org.apache.cassandra.db.ColumnFamily.addColumn(IColumn)
 ColumnFamily.addColumn() is slow because it inserts into an internal 
 concurrentSkipListMap() that maps column names to values.
 this structure is slow for two reasons: it needs to do synchronization; it 
 needs to maintain a more complex structure of map.
 but if we look at the whole read path, thrift already defines the read output 
 to be ListColumnOrSuperColumn so it does not make sense to use a luxury map 
 data structure in the interium and finally convert it to a list. on the 
 synchronization side, since the return CF is never going to be 
 shared/modified by other threads, we know the access is always single thread, 
 so no synchronization is needed.
 but these 2 features are indeed needed for ColumnFamily in other cases, 
 particularly write. so we can provide a different ColumnFamily to 
 CFS.getTopLevelColumnFamily(), so getTopLevelColumnFamily no longer always 
 creates the standard ColumnFamily, but take a provided returnCF, whose cost 
 is much cheaper.
 the provided patch is for demonstration now, will work further once we agree 
 on the general direction. 
 CFS, ColumnFamily, and Table  are changed; a new FastColumnFamily is 
 provided. the main work is to let the FastColumnFamily use an array  for 
 internal storage. at first I used binary search to insert new columns in 
 addColumn(), but later I found that even this is not necessary, since all 
 calling scenarios of ColumnFamily.addColumn() has an invariant that the 
 inserted columns come in sorted order (I still have an issue to resolve 
 descending or ascending  now, but ascending works). so the current logic is 
 simply to compare the new column against the end column in the array, if 
 names not equal, append, if equal, reconcile.
 slight temporary hacks are made on getTopLevelColumnFamily so we have 2 
 flavors of the method, one accepting a returnCF. but we could definitely 
 think about what is the better way to provide this returnCF.
 this patch compiles fine, no tests are provided yet. but I tested it in my 
 application, and the performance improvement is dramatic: it offers about 50% 
 reduction in read time in the 3000-column case.
 thanks
 Yang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2843) better performance on long row read

2011-07-01 Thread Sylvain Lebresne (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13058256#comment-13058256
]

Sylvain Lebresne commented on CASSANDRA-2843:
-

The usual way to do thinks is to attach a patch. But I see that your diff don't
include FastColumnFamily. The patch also include instrumentation and a few
unrelated chances (a commented method, a change from SortedSet to Set in an
unrelated method signature) that would ideally be removed. It would be great to
have this rebase to the current 0.8 branch too.

better performance on long row read
---

Key: CASSANDRA-2843
URL: https://issues.apache.org/jira/browse/CASSANDRA-2843
Project: Cassandra
Issue Type: New Feature
Reporter: Yang Yang
Attachments: b.tar.gz, fast_cf.diff

currently if a row contains 1000 columns, the run time becomes considerably
slow (my test of
a row with 30 00 columns (standard, regular) each with 8 bytes in name, and
40 bytes in value, is about 16ms.
this is all running in memory, no disk read is involved.
through debugging we can find
most of this time is spent on
[Wall Time] org.apache.cassandra.db.Table.getRow(QueryFilter)
[Wall Time]
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(QueryFilter,
ColumnFamily)
[Wall Time]
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(QueryFilter, int,
ColumnFamily)
[Wall Time]
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(QueryFilter,
int, ColumnFamily)
[Wall Time]
org.apache.cassandra.db.filter.QueryFilter.collectCollatedColumns(ColumnFamily,
Iterator, int)
[Wall Time]
org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(IColumnContainer,
Iterator, int)
[Wall Time] org.apache.cassandra.db.ColumnFamily.addColumn(IColumn)
ColumnFamily.addColumn() is slow because it inserts into an internal
concurrentSkipListMap() that maps column names to values.
this structure is slow for two reasons: it needs to do synchronization; it
needs to maintain a more complex structure of map.
but if we look at the whole read path, thrift already defines the read output
to be ListColumnOrSuperColumn so it does not make sense to use a luxury map
data structure in the interium and finally convert it to a list. on the
synchronization side, since the return CF is never going to be
shared/modified by other threads, we know the access is always single thread,
so no synchronization is needed.
but these 2 features are indeed needed for ColumnFamily in other cases,
particularly write. so we can provide a different ColumnFamily to
CFS.getTopLevelColumnFamily(), so getTopLevelColumnFamily no longer always
creates the standard ColumnFamily, but take a provided returnCF, whose cost
is much cheaper.
the provided patch is for demonstration now, will work further once we agree
on the general direction.
CFS, ColumnFamily, and Table are changed; a new FastColumnFamily is
provided. the main work is to let the FastColumnFamily use an array for
internal storage. at first I used binary search to insert new columns in
addColumn(), but later I found that even this is not necessary, since all
calling scenarios of ColumnFamily.addColumn() has an invariant that the
inserted columns come in sorted order (I still have an issue to resolve
descending or ascending now, but ascending works). so the current logic is
simply to compare the new column against the end column in the array, if
names not equal, append, if equal, reconcile.
slight temporary hacks are made on getTopLevelColumnFamily so we have 2
flavors of the method, one accepting a returnCF. but we could definitely
think about what is the better way to provide this returnCF.
this patch compiles fine, no tests are provided yet. but I tested it in my
application, and the performance improvement is dramatic: it offers about 50%
reduction in read time in the 3000-column case.
thanks
Yang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (CASSANDRA-2844) grep friendly nodetool compactionstats output

2011-07-01 Thread Wojciech Meler (JIRA)

grep friendly nodetool compactionstats output
-

 Key: CASSANDRA-2844
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2844
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Affects Versions: 0.8.1
Reporter: Wojciech Meler
Priority: Trivial


output from nodetool compactionstats is quite hard to parse with text tools - 
it would be nice to have one line per compaction

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2844) grep friendly nodetool compactionstats output

2011-07-01 Thread Wojciech Meler (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wojciech Meler updated CASSANDRA-2844:
--

Attachment: comapctionstats.patch

patch for 0.8.1 that do the job

 grep friendly nodetool compactionstats output
 -

 Key: CASSANDRA-2844
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2844
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Affects Versions: 0.8.1
Reporter: Wojciech Meler
Priority: Trivial
 Attachments: comapctionstats.patch


 output from nodetool compactionstats is quite hard to parse with text tools - 
 it would be nice to have one line per compaction

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

43 matches

Mail list logo