Re: Lucandra Ingestion
Jonathan Ellis-3 wrote: Are you using multiple threads? I'm adding in threading now, and getting exceptions at times regarding a broken pipe. I then added the following : synchronized (this) { indexWriter.addDocument(doc, analyzer); } Which did get rid of the problem. I'm currently using Phasers (jsr166) to register threads per file found in a given directory. Although it still seems slow. Has anyone else ingested large # of files, and found ways to optimize ingestion? If I apply a patch for batch operations (from the link in the post), will this work with the version of cassandra supported by lucandra? Thanks again. -- View this message in context: http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4457044.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
Re: Lucandra Ingestion
Sure. Thanks Jake Jon! -- View this message in context: http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4457366.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
Re: Lucandra Ingestion
Thanks Jake. No, I'm currently not using multiple threads. I will do that next. Thanks for the link, hopefully Lucandra will support this as well. Jake Luciani wrote: How big are the documents? Each term requires an insert so it's def slow on Lucandra's side. Once the bulk insert for many keys is in available this should go much faster. https://issues.apache.org/jira/browse/CASSANDRA-336 Looks like it will be in 0.6 release. -Jake On Mon, Jan 18, 2010 at 2:29 PM, ML_Seda sonnyh...@gmail.com wrote: I'm inserting a lot of documents into Cassandra/Lucandra. The problem is, the ingestion is fairly slow: addDocument(Document doc, Analyzer analyzer) method takes 25-50 milliseconds Was there any work done to speed this up? maybe a bulk insert? Thanks -- View this message in context: http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4415691.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com. -- View this message in context: http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4415835.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
Re: Lucandra Ingestion
This particular document is 8,158 bytes, and I am storing up to six fields only one of which is indexed and stored. indexing is taking: Indexing Took: 14714ms* This is problematic when I'm trying to ingest millions of documents. Jake Luciani wrote: How big are the documents? Each term requires an insert so it's def slow on Lucandra's side. Once the bulk insert for many keys is in available this should go much faster. https://issues.apache.org/jira/browse/CASSANDRA-336 Looks like it will be in 0.6 release. -Jake On Mon, Jan 18, 2010 at 2:29 PM, ML_Seda sonnyh...@gmail.com wrote: I'm inserting a lot of documents into Cassandra/Lucandra. The problem is, the ingestion is fairly slow: addDocument(Document doc, Analyzer analyzer) method takes 25-50 milliseconds Was there any work done to speed this up? maybe a bulk insert? Thanks -- View this message in context: http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4415691.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com. -- View this message in context: http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4415874.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
Re: Data Model Index Text
I'm assuming I have to run the thrift gen-java from cassandra .4 release. Is there any documentation or tutorial on how to get that up and running? I've checked both cassandra and lucandra into eclipse, but the lucandra project is still unable to resolve some Classes. This is because I need to generate the java client classes? Thanks. Jake Luciani wrote: It should work but not a ton has changed in 2.9/3.0 AFAIK. I'm going to work on updating Lucandra to work with 0.5 branch I can try to update this as well. BTW, if you want to see Lucandra in action check out http://flocking.me (example: http://flocking.me/tjake ) You can use a random partitioner if you store the entire index under a supercolumn (how it was originally implemented) but then you need to accept the entire index will be in memory for any operation on that index (bad for big indexes). -Jake On Wed, Jan 13, 2010 at 9:14 AM, Ryan Daum r...@thimbleware.com wrote: On the topic of Lucandra, apart from having it work with 0.5 of Cassandra, has any work been done to get it up to date with Lucene 2.9/3.0? Also, I'm a bit concerned about its use of OrderPreservingPartitioner; is there an architecture for storage that could be considered that would work with RandomPartitioner? Ryan On Tue, Jan 12, 2010 at 12:20 PM, ML_Seda sonnyh...@gmail.com wrote: i do see the classes now, but All the way back in version .20. Is there a newer version of Lucandra. It would be nice for us to use the lastest cassandra (trunk). -- View this message in context: http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4293071.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com. -- View this message in context: http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4349520.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
Re: Data Model Index Text
i do see the classes now, but All the way back in version .20. Is there a newer version of Lucandra. It would be nice for us to use the lastest cassandra (trunk). -- View this message in context: http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4293071.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
Re: Data Model Index Text
Thanks Drew. That is correct. That would be one of the queries (give me all documents in which a list of terms are present) But not only that, another query would allow users to search for words around a given word. Keying in Michael would have a list of words in all documents after the word Michael (e.g. Jordan, Jackson etc). The same is done for words before a given word. Is cassandra not optimal for this? As pointed out by Ian I will look into Lucandra as well. Thanks. -- View this message in context: http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4286704.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
Re: Data Model Index Text
Is there a particular version of cassandra required for Lucandra to work? It's not able to resolve Cassandra Class, along with a few others. I have trunk cassandra checked out, and Lucandra from github the link provided below. Ian Holsman-3 wrote: Hi ML. this sounds more like a job for SOLR, but if you want to do this with cassandra, you should look at Jake's Lucandra http://github.com/tjake/Lucandra you should also look at http://nicklothian.com/blog/2009/10/27/solr-cassandra-solandra/ I wouldn't recommend you building your own IR engine, just use one of the ones out there. regards Ian On Jan 9, 2010, at 9:12 AM, ML_Seda wrote: Hey, I've been reading up on the Cassandra data model a bit, and would like to get some input from this forum on different techniques for a particular problem. Assume I need to index millions of text docs (e.g. research papers), and allow the ability to query them by a given word inside or around any of the indexed docs. meaning if i search for terms i would like to get a list of docs in which these terms show up (e.g. Michael Jordan = Michael is the main term, and Jordan is next term n1. The same can be applied by indicating previous terms to Michael) How do I model this in Cassandra? Would my Keys be a concat of the middle term + docid? Will I be able to do queries by wildcarding the docid? Thanks. -- View this message in context: http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4275199.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com. -- Ian Holsman i...@holsman.net -- View this message in context: http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4288808.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
Re: Data Model Index Text
Thanks Jake. I don't see import org.apache.cassandra.service.Cassandra in 0.4 which is referenced in BookmarksDemo.java. Jake Luciani wrote: currently uses the 0.4 release series. On Mon, Jan 11, 2010 at 6:21 PM, ML_Seda sonnyh...@gmail.com wrote: Is there a particular version of cassandra required for Lucandra to work? It's not able to resolve Cassandra Class, along with a few others. I have trunk cassandra checked out, and Lucandra from github the link provided below. Ian Holsman-3 wrote: Hi ML. this sounds more like a job for SOLR, but if you want to do this with cassandra, you should look at Jake's Lucandra http://github.com/tjake/Lucandra you should also look at http://nicklothian.com/blog/2009/10/27/solr-cassandra-solandra/ I wouldn't recommend you building your own IR engine, just use one of the ones out there. regards Ian On Jan 9, 2010, at 9:12 AM, ML_Seda wrote: Hey, I've been reading up on the Cassandra data model a bit, and would like to get some input from this forum on different techniques for a particular problem. Assume I need to index millions of text docs (e.g. research papers), and allow the ability to query them by a given word inside or around any of the indexed docs. meaning if i search for terms i would like to get a list of docs in which these terms show up (e.g. Michael Jordan = Michael is the main term, and Jordan is next term n1. The same can be applied by indicating previous terms to Michael) How do I model this in Cassandra? Would my Keys be a concat of the middle term + docid? Will I be able to do queries by wildcarding the docid? Thanks. -- View this message in context: http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4275199.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com. -- Ian Holsman i...@holsman.net -- View this message in context: http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4288808.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com. -- View this message in context: http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4289009.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
Data Model Index Text
Hey, I've been reading up on the Cassandra data model a bit, and would like to get some input from this forum on different techniques for a particular problem. Assume I need to index millions of text docs (e.g. research papers), and allow the ability to query them by a given word inside or around any of the indexed docs. meaning if i search for terms i would like to get a list of docs in which these terms show up (e.g. Michael Jordan = Michael is the main term, and Jordan is next term n1. The same can be applied by indicating previous terms to Michael) How do I model this in Cassandra? Would my Keys be a concat of the middle term + docid? Will I be able to do queries by wildcarding the docid? Thanks. -- View this message in context: http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4275199.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.