Re: Lucandra Ingestion

2010-01-25 Thread ML_Seda


Jonathan Ellis-3 wrote:
 
 Are you using multiple threads?
 

I'm adding in threading now, and getting exceptions at times regarding a
broken pipe.

I then added the following :
synchronized (this) {
indexWriter.addDocument(doc, analyzer);
}

Which did get rid of the problem.  I'm currently using Phasers (jsr166) to
register threads per file found in a given directory.  Although it still
seems slow.  

Has anyone else ingested large # of files, and found ways to optimize
ingestion?  If I apply a patch for batch operations (from the link in the
post), will this work with the version of cassandra supported by lucandra? 

Thanks again.

-- 
View this message in context: 
http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4457044.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Lucandra Ingestion

2010-01-25 Thread ML_Seda

Sure.  Thanks Jake  Jon!
-- 
View this message in context: 
http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4457366.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Lucandra Ingestion

2010-01-18 Thread ML_Seda

Thanks Jake.  No, I'm currently not using multiple threads.  I will do that
next.

Thanks for the link, hopefully Lucandra will support this as well.  


Jake Luciani wrote:
 
 How big are the documents?  Each term requires an insert so it's def slow
 on
 Lucandra's side. Once the bulk insert for many keys is in available this
 should go much faster.
 
 https://issues.apache.org/jira/browse/CASSANDRA-336
 
 Looks like it will be in 0.6 release.
 
 -Jake
 
 On Mon, Jan 18, 2010 at 2:29 PM, ML_Seda sonnyh...@gmail.com wrote:
 

 I'm inserting a lot of documents into Cassandra/Lucandra.  The problem
 is,
 the ingestion is fairly slow:

 addDocument(Document doc, Analyzer analyzer)

 method takes 25-50 milliseconds

 Was there any work done to speed this up?  maybe a bulk insert?

 Thanks
 --
 View this message in context:
 http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4415691.html
 Sent from the cassandra-user@incubator.apache.org mailing list archive at
 Nabble.com.

 
 

-- 
View this message in context: 
http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4415835.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Lucandra Ingestion

2010-01-18 Thread ML_Seda

This particular document is 8,158 bytes, and I am storing up to six fields
only one of which is indexed and stored.

indexing is taking:
Indexing Took: 14714ms*

This is problematic when I'm trying to ingest millions of documents.


Jake Luciani wrote:
 
 How big are the documents?  Each term requires an insert so it's def slow
 on
 Lucandra's side. Once the bulk insert for many keys is in available this
 should go much faster.
 
 https://issues.apache.org/jira/browse/CASSANDRA-336
 
 Looks like it will be in 0.6 release.
 
 -Jake
 
 On Mon, Jan 18, 2010 at 2:29 PM, ML_Seda sonnyh...@gmail.com wrote:
 

 I'm inserting a lot of documents into Cassandra/Lucandra.  The problem
 is,
 the ingestion is fairly slow:

 addDocument(Document doc, Analyzer analyzer)

 method takes 25-50 milliseconds

 Was there any work done to speed this up?  maybe a bulk insert?

 Thanks
 --
 View this message in context:
 http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4415691.html
 Sent from the cassandra-user@incubator.apache.org mailing list archive at
 Nabble.com.

 
 

-- 
View this message in context: 
http://n2.nabble.com/Lucandra-Ingestion-tp4415691p4415874.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Data Model Index Text

2010-01-13 Thread ML_Seda

I'm assuming I have to run the thrift gen-java from cassandra .4 release.  Is
there any documentation or tutorial on how to get that up and running?

I've checked both cassandra and lucandra into eclipse, but the lucandra
project is still unable to resolve some Classes.  This is because I need to
generate the java client classes?

Thanks.

Jake Luciani wrote:
 
 It should work but not a ton has changed in 2.9/3.0 AFAIK.  I'm going to
 work on updating Lucandra to work with 0.5 branch I can try to update this
 as well.  BTW, if you want to see Lucandra in action check out
 http://flocking.me (example: http://flocking.me/tjake )
 
 You can use a random partitioner if you store the entire index under a
 supercolumn (how it was originally implemented) but then you need to
 accept
 the entire index will be in memory for any operation on that index (bad
 for
 big indexes).
 
 -Jake
 
 On Wed, Jan 13, 2010 at 9:14 AM, Ryan Daum r...@thimbleware.com wrote:
 
 On the topic of Lucandra, apart from having it work with 0.5 of
 Cassandra,
 has any work been done to get it up to date with Lucene 2.9/3.0?

 Also, I'm a bit concerned about its use of OrderPreservingPartitioner; is
 there an architecture for storage that could be considered that would
 work
 with RandomPartitioner?

 Ryan


 On Tue, Jan 12, 2010 at 12:20 PM, ML_Seda sonnyh...@gmail.com wrote:


 i do see the classes now, but All the way back in version .20.  Is there
 a
 newer version of Lucandra.  It would be nice for us to use the lastest
 cassandra (trunk).
 --
 View this message in context:
 http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4293071.html
 Sent from the cassandra-user@incubator.apache.org mailing list archive
 at
 Nabble.com.



 
 

-- 
View this message in context: 
http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4349520.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Data Model Index Text

2010-01-12 Thread ML_Seda

i do see the classes now, but All the way back in version .20.  Is there a
newer version of Lucandra.  It would be nice for us to use the lastest
cassandra (trunk).
-- 
View this message in context: 
http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4293071.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Data Model Index Text

2010-01-11 Thread ML_Seda

Thanks Drew.  That is correct.

That would be one of the queries (give me all documents in which a list of
terms are present)
But not only that, another query would allow users to search for words
around a given word.

Keying in Michael would have a list of words in all documents after the
word Michael (e.g. Jordan, Jackson etc).  The same is done for words
before a given word.

Is cassandra not optimal for this?  As pointed out by Ian I will look into
Lucandra as well.  Thanks.
-- 
View this message in context: 
http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4286704.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Data Model Index Text

2010-01-11 Thread ML_Seda

Is there a particular version of cassandra required for Lucandra to work? 

It's not able to resolve Cassandra Class, along with a few others.   I have
trunk cassandra checked out, and Lucandra from github the link provided
below.


Ian Holsman-3 wrote:
 
 Hi ML.
 this sounds more like a job for SOLR, but if you want to do this with
 cassandra, 
 you should look at Jake's Lucandra http://github.com/tjake/Lucandra
 
 
 you should also look at
 http://nicklothian.com/blog/2009/10/27/solr-cassandra-solandra/
 
 I wouldn't recommend you building your own IR engine, just use one of the
 ones out there.
 
 regards
 Ian
 On Jan 9, 2010, at 9:12 AM, ML_Seda wrote:
 
 
 Hey,
 
 I've been reading up on the Cassandra data model a bit, and would like to
 get some input from this forum on different techniques for a particular
 problem.
 
 Assume I need to index millions of text docs (e.g. research papers), and
 allow the ability to query them by a given word inside or around any of
 the
 indexed docs.  meaning if i search for terms i would like to get a list
 of
 docs in which these terms show up (e.g. Michael Jordan = Michael is the
 main
 term, and Jordan is next term n1.  The same can be applied by indicating
 previous terms to Michael)
 
 How do I model this in Cassandra?
 
 Would my Keys be a concat of the middle term + docid?  Will I be able to
 do
 queries by wildcarding the docid?
 
 Thanks.
 -- 
 View this message in context:
 http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4275199.html
 Sent from the cassandra-user@incubator.apache.org mailing list archive at
 Nabble.com.
 
 --
 Ian Holsman
 i...@holsman.net
 
 
 
 
 

-- 
View this message in context: 
http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4288808.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Data Model Index Text

2010-01-11 Thread ML_Seda

Thanks Jake.

I don't see import org.apache.cassandra.service.Cassandra in 0.4 which is
referenced in BookmarksDemo.java.



Jake Luciani wrote:
 
 currently uses the 0.4 release series.
 
 On Mon, Jan 11, 2010 at 6:21 PM, ML_Seda sonnyh...@gmail.com wrote:
 

 Is there a particular version of cassandra required for Lucandra to work?

 It's not able to resolve Cassandra Class, along with a few others.   I
 have
 trunk cassandra checked out, and Lucandra from github the link provided
 below.


 Ian Holsman-3 wrote:
 
  Hi ML.
  this sounds more like a job for SOLR, but if you want to do this with
  cassandra,
  you should look at Jake's Lucandra http://github.com/tjake/Lucandra
 
 
  you should also look at
  http://nicklothian.com/blog/2009/10/27/solr-cassandra-solandra/
 
  I wouldn't recommend you building your own IR engine, just use one of
 the
  ones out there.
 
  regards
  Ian
  On Jan 9, 2010, at 9:12 AM, ML_Seda wrote:
 
 
  Hey,
 
  I've been reading up on the Cassandra data model a bit, and would like
 to
  get some input from this forum on different techniques for a
 particular
  problem.
 
  Assume I need to index millions of text docs (e.g. research papers),
 and
  allow the ability to query them by a given word inside or around any
 of
  the
  indexed docs.  meaning if i search for terms i would like to get a
 list
  of
  docs in which these terms show up (e.g. Michael Jordan = Michael is
 the
  main
  term, and Jordan is next term n1.  The same can be applied by
 indicating
  previous terms to Michael)
 
  How do I model this in Cassandra?
 
  Would my Keys be a concat of the middle term + docid?  Will I be able
 to
  do
  queries by wildcarding the docid?
 
  Thanks.
  --
  View this message in context:
  http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4275199.html
  Sent from the cassandra-user@incubator.apache.org mailing list archive
 at
  Nabble.com.
 
  --
  Ian Holsman
  i...@holsman.net
 
 
 
 
 

 --
 View this message in context:
 http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4288808.html
 Sent from the cassandra-user@incubator.apache.org mailing list archive at
 Nabble.com.

 
 

-- 
View this message in context: 
http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4289009.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at 
Nabble.com.


Data Model Index Text

2010-01-08 Thread ML_Seda

Hey,

I've been reading up on the Cassandra data model a bit, and would like to
get some input from this forum on different techniques for a particular
problem.

Assume I need to index millions of text docs (e.g. research papers), and
allow the ability to query them by a given word inside or around any of the
indexed docs.  meaning if i search for terms i would like to get a list of
docs in which these terms show up (e.g. Michael Jordan = Michael is the main
term, and Jordan is next term n1.  The same can be applied by indicating
previous terms to Michael)

How do I model this in Cassandra?

Would my Keys be a concat of the middle term + docid?  Will I be able to do
queries by wildcarding the docid?

Thanks.
-- 
View this message in context: 
http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4275199.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at 
Nabble.com.