date:20110811

[jira] [Commented] (CASSANDRA-3003) Trunk single-pass streaming doesn't handle large row correctly

2011-08-11 Thread Sylvain Lebresne (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13082991#comment-13082991
]

Sylvain Lebresne commented on CASSANDRA-3003:
-

bq. I'm probably missing something, but isn't the problem that this can't be
done without two passes for rows that are too large to fit in memory?

Hum true. What we need to do is deserialize each row with the 'fromRemote' flag
on so that the delta are cleaned up, and them reserialize the result. But that
will potentially reduce the column serialized size (and thus modify the row
total size and the column index). Now we could imagine to remember the offset
of the beginning of the row, to load the column index in memory and update it
during the first pass (it would likely be ok to simply update the index offsets
without changing the index structure itself), and to seek back at the end to
write the updated data size and column index. However, this unfortunately won't
be doable with the current SequentialWriter (and CompressedSequentialWriter)
since we cannot seek back (without truncating). Retrospectively, it would have
been nicer to have the cleaning of a counter context not change its size :(

So yeah, it sucks. I'm still mildly fan of moving the cleanup because it feels
wrong somehow. It feels it would be better to have that delta cleaning done
sooner than latter. But this may end up being the simplest/more efficient
solution.

Trunk single-pass streaming doesn't handle large row correctly
--

Key: CASSANDRA-3003
URL: https://issues.apache.org/jira/browse/CASSANDRA-3003
Project: Cassandra
Issue Type: Bug
Components: Core
Reporter: Sylvain Lebresne
Assignee: Yuki Morishita
Priority: Critical
Labels: streaming

For normal column family, trunk streaming always buffer the whole row into
memory. In uses
{noformat}
ColumnFamily.serializer().deserializeColumns(in, cf, true, true);
{noformat}
on the input bytes.
We must avoid this for rows that don't fit in the inMemoryLimit.
Note that for regular column families, for a given row, there is actually no
need to even recreate the bloom filter of column index, nor to deserialize
the columns. It is enough to filter the key and row size to feed the index
writer, but then simply dump the rest on disk directly. This would make
streaming more efficient, avoid a lot of object creation and avoid the
pitfall of big rows.
Counters column family are unfortunately trickier, because each column needs
to be deserialized (to mark them as 'fromRemote'). However, we don't need to
do the double pass of LazilyCompactedRow for that. We can simply use a
SSTableIdentityIterator and deserialize/reserialize input as it comes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3014) AnitEntropy/MerkleTree Error

2011-08-11 Thread Sylvain Lebresne (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13083002#comment-13083002
 ] 

Sylvain Lebresne commented on CASSANDRA-3014:
-

Yeah, this has a real chance to have been fixed since 0.8.0. This seems to be a 
problem of the merkle tree being deeper that the max depth limit. So it is 
likely a problem during the creation of the tree that create a deeper tree than 
allowed (possibly by just 1). However, since 0.8.0, we have both fixed a 
related bug in the creation function (CASSANDRA-2758) and changed the creation 
function used for RandomPartitioner (CASSANDRA-2841).

 AnitEntropy/MerkleTree Error
 

 Key: CASSANDRA-3014
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3014
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 0.8.0
Reporter: Hefeng Yuan
Priority: Minor

 Hi,
 We are seeing some AntiEntropy error in our production servers, since it's 
 hard to repro in other env, pasting the stack track here for some clue. 
 We're using a cluster of 2 data centers, 9 nodes, 6 for online traffic, 3 for 
 Brisk BI. Our RF is Cassandra:5, Brisk:1.
 Our Cassandra server version is 0.8.0. Data is written using Hector 0.7.0-20 
 and cassandra 0.7.2 library.
 Partitioner is random.
 Our nodetool repair is scheduled once per week.
 The exception stack is appended in the end.
 Any help is appreciated.
 Thanks,
 Hefeng
 ERROR [AntiEntropyStage:2] 2011-08-08 04:24:39,556 
 AbstractCassandraDaemon.java (line 113) Fatal exception in thread 
 Thread[AntiEntropyStage:2,5,main]
 java.lang.AssertionError
   at org.apache.cassandra.utils.MerkleTree.inc(MerkleTree.java:154)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:262)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:273)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:273)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:273)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:273)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:273)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:273)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:273)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:273)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:273)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:273)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:273)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:273)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:273)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:273)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:284)
   at 
 org.apache.cassandra.utils.MerkleTree.differenceHelper(MerkleTree.java:273)
   at

[jira] [Commented] (CASSANDRA-2982) Refactor secondary index api

2011-08-11 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13083065#comment-13083065
 ] 

Hudson commented on CASSANDRA-2982:
---

Integrated in Cassandra #1016 (See 
[https://builds.apache.org/job/Cassandra/1016/])
Refactoring of the secondary index api
patch by Jake Luciani; reviewed by Pavel Yaskevich for CASSANDRA-2982

xedin : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1156567
Files : 
* /cassandra/trunk/src/java/org/apache/cassandra/db/MeteredFlusher.java
* 
/cassandra/trunk/src/java/org/apache/cassandra/db/index/SecondaryIndexManager.java
* /cassandra/trunk/src/java/org/apache/cassandra/db/Table.java
* /cassandra/trunk/src/java/org/apache/cassandra/cql/QueryProcessor.java
* /cassandra/trunk/test/unit/org/apache/cassandra/db/ColumnFamilyStoreTest.java
* /cassandra/trunk/src/java/org/apache/cassandra/db/index/keys/KeysIndex.java
* /cassandra/trunk/test/unit/org/apache/cassandra/db/CleanupTest.java
* 
/cassandra/trunk/src/java/org/apache/cassandra/db/compaction/CompactionManager.java
* /cassandra/trunk/src/java/org/apache/cassandra/db/index
* /cassandra/trunk/src/java/org/apache/cassandra/db/index/keys/KeysSearcher.java
* /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStore.java
* 
/cassandra/trunk/src/java/org/apache/cassandra/service/IndexScanVerbHandler.java
* /cassandra/trunk/src/java/org/apache/cassandra/db/index/SecondaryIndex.java
* /cassandra/trunk/src/java/org/apache/cassandra/thrift/ThriftValidation.java
* /cassandra/trunk/CHANGES.txt
* 
/cassandra/trunk/src/java/org/apache/cassandra/db/index/SecondaryIndexBuilder.java
* /cassandra/trunk/src/java/org/apache/cassandra/db/index/keys
* /cassandra/trunk/src/java/org/apache/cassandra/streaming/StreamInSession.java
* 
/cassandra/trunk/src/java/org/apache/cassandra/db/index/SecondaryIndexSearcher.java
* 
/cassandra/trunk/test/unit/org/apache/cassandra/streaming/StreamingTransferTest.java
* /cassandra/trunk/test/unit/org/apache/cassandra/db/DefsTest.java


 Refactor secondary index api
 

 Key: CASSANDRA-2982
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2982
 Project: Cassandra
  Issue Type: Sub-task
  Components: Core
Reporter: T Jake Luciani
Assignee: T Jake Luciani
 Fix For: 1.0

 Attachments: 2982-v1.txt, 2982-v2.txt, CASSANDRA-2982-v3.patch


 Secondary indexes currently make some bad assumptions about the underlying 
 indexes.
 1. That they are always stored in other column families.
 2. That there is a unique index per column
 In the case of CASSANDRA-2915 neither of these are true.  The new api should 
 abstract the search concepts and allow any search api to plug in.
 Once the code is refactored and basically pluggable we can remove the 
 IndexType enum and use class names similar to how we handle partitioners and 
 comparators.
 Basic api is to add a SecondaryIndexManager that handles different index 
 types per CF and a SecondaryIndex base class that handles a particular type 
 implementation.
 This requires major changes to ColumnFamilyStore and Table.IndexBuilder

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2975) Upgrade MurmurHash to version 3

2011-08-11 Thread Pavel Yaskevich (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13083077#comment-13083077
 ] 

Pavel Yaskevich commented on CASSANDRA-2975:


First of all - can you please rebase both with latest trunk and attach them to 
JIRA?

What I see from first look (about patch for backward compatibility):
  
  - I think we should extract interface from BloomFilter class and make BF a 
factory as now we have Murmur{2,3}BloomFilter classes 
  - Needs a test for compatibility with old SSTables (which are using Murmur2BF)
  - minor note: comment about new SSTable version is missing at the top of the 
Descriptor class

As soon as you attach files in here I will apply and play with them and maybe 
find other problems.




 Upgrade MurmurHash to version 3
 ---

 Key: CASSANDRA-2975
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2975
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Brian Lindauer
Assignee: Brian Lindauer
Priority: Trivial
  Labels: lhf
 Fix For: 1.0


 MurmurHash version 3 was finalized on June 3. It provides an enormous speedup 
 and increased robustness over version 2, which is implemented in Cassandra. 
 Information here:
 http://code.google.com/p/smhasher/
 The reference implementation is here:
 http://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp?spec=svn136r=136
 I have already done the work to port the (public domain) reference 
 implementation to Java in the MurmurHash class and updated the BloomFilter 
 class to use the new implementation:
 https://github.com/lindauer/cassandra/commit/cea6068a4a3e5d7d9509335394f9ef3350d37e93
 Apart from the faster hash time, the new version only requires one call to 
 hash() rather than 2, since it returns 128 bits of hash instead of 64.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (CASSANDRA-2915) Lucene based Secondary Indexes

2011-08-11 Thread T Jake Luciani (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

T Jake Luciani reassigned CASSANDRA-2915:
-

Assignee: Jason Rutherglen

 Lucene based Secondary Indexes
 --

 Key: CASSANDRA-2915
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter: T Jake Luciani
Assignee: Jason Rutherglen
  Labels: secondary_index
 Fix For: 1.0


 Secondary indexes (of type KEYS) suffer from a number of limitations in their 
 current form:
- Multiple IndexClauses only work when there is a subset of rows under the 
 highest clause
- One new column family is created per index this means 10 new CFs for 10 
 secondary indexes
 This ticket will use the Lucene library to implement secondary indexes as one 
 index per CF, and utilize the Lucene query engine to handle multiple index 
 clauses. Also, by using the Lucene we get a highly optimized file format.
 There are a few parallels we can draw between Cassandra and Lucene.
 Lucene indexes segments in memory then flushes them to disk so we can sync 
 our memtable flushes to lucene flushes. Lucene also has optimize() which 
 correlates to our compaction process, so these can be sync'd as well.
 We will also need to correlate column validators to Lucene tokenizers, so the 
 data can be stored properly, the big win in once this is done we can perform 
 complex queries within a column like wildcard searches.
 The downside of this approach is we will need to read before write since 
 documents in Lucene are written as complete documents. For random workloads 
 with lot's of indexed columns this means we need to read the document from 
 the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

2011-08-11 Thread T Jake Luciani (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13083099#comment-13083099
 ] 

T Jake Luciani commented on CASSANDRA-2915:
---

Under the CF dir I imagine

 Lucene based Secondary Indexes
 --

 Key: CASSANDRA-2915
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter: T Jake Luciani
Assignee: Jason Rutherglen
  Labels: secondary_index
 Fix For: 1.0


 Secondary indexes (of type KEYS) suffer from a number of limitations in their 
 current form:
- Multiple IndexClauses only work when there is a subset of rows under the 
 highest clause
- One new column family is created per index this means 10 new CFs for 10 
 secondary indexes
 This ticket will use the Lucene library to implement secondary indexes as one 
 index per CF, and utilize the Lucene query engine to handle multiple index 
 clauses. Also, by using the Lucene we get a highly optimized file format.
 There are a few parallels we can draw between Cassandra and Lucene.
 Lucene indexes segments in memory then flushes them to disk so we can sync 
 our memtable flushes to lucene flushes. Lucene also has optimize() which 
 correlates to our compaction process, so these can be sync'd as well.
 We will also need to correlate column validators to Lucene tokenizers, so the 
 data can be stored properly, the big win in once this is done we can perform 
 complex queries within a column like wildcard searches.
 The downside of this approach is we will need to read before write since 
 documents in Lucene are written as complete documents. For random workloads 
 with lot's of indexed columns this means we need to read the document from 
 the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2950) Data from truncated CF reappears after server restart

2011-08-11 Thread Sylvain Lebresne (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Lebresne updated CASSANDRA-2950:


Attachment: 2950-v3_0.8.patch

Attaching a v3 that is rebased against 0.8. I've also slightly change the logic 
in Truncate to submit all the flushes and then call waitForActiveFlushes, as 
this is slightly simpler and should work equally well as far as I can tell.
Apart from that, this lgtm.

 Data from truncated CF reappears after server restart
 -

 Key: CASSANDRA-2950
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2950
 Project: Cassandra
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Cathy Daw
Assignee: Jonathan Ellis
 Fix For: 0.8.5

 Attachments: 2950-v2.txt, 2950-v3_0.8.patch, 2950.txt


 * Configure 3 node cluster
 * Ensure the java stress tool creates Keyspace1 with RF=3
 {code}
 // Run Stress Tool to generate 10 keys, 1 column
 stress --operation=INSERT -t 2 --num-keys=50 --columns=20 
 --consistency-level=QUORUM --average-size-values --replication-factor=3 
 --create-index=KEYS --nodes=cathy1,cathy2
 // Verify 50 keys in CLI
 use Keyspace1; 
 list Standard1; 
 // TRUNCATE CF in CLI
 use Keyspace1;
 truncate counter1;
 list counter1;
 // Run stress tool and verify creation of 1 key with 10 columns
 stress --operation=INSERT -t 2 --num-keys=1 --columns=10 
 --consistency-level=QUORUM --average-size-values --replication-factor=3 
 --create-index=KEYS --nodes=cathy1,cathy2
 // Verify 1 key in CLI
 use Keyspace1; 
 list Standard1; 
 // Restart all three nodes
 // You will see 51 keys in CLI
 use Keyspace1; 
 list Standard1; 
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

73 matches

Mail list logo