Re: Proposal to move/hide the general@lucene list
+1 for getting rid of it. Doesn't seem to serve any purpose. Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Jan 24, 2013 at 11:19 AM, Smiley, David W. dsmi...@mitre.orgwrote: The general@lucene.apache.org list is often misused for help on Lucene or Solr that belong on their respective lists. I'm okay with the list being discontinued. If people are not okay with that, then I propose modifying the page where people currently discover the list so that they aren't likely to use it instead of the proper list. http://lucene.apache.org/core/discussion.html#general-discussion-generallucenehttp://lucene.apache.org/core/discussion.html Perhaps a simply adding the text NOT for users seeking help with Lucene message in red. I can see how users, in a hurry, can look at the existing description (without having red the java-user list prior) and think that the general list is the right place. ~ David
Re: Choosing the right project
Hi, For a neutral Solr/ES comparison look at http://blog.sematext.com/ Nutch can indexing into Solr, but I thiiink not into ES yet. ManifoldCF has a crawler and can index into both Solr and ES, but it's crawler is not made for large scale crawling as is the case with Nutch. Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm Search Analytics - http://sematext.com/search-analytics/index.html From: timd timdavi...@msn.com To: general@lucene.apache.org Sent: Friday, November 23, 2012 3:13 AM Subject: Choosing the right project Hello all, I am planning to build a general purpose search engine, where rankings for certain types of online resources (e.g. cll phones) will be boosted based on an external rating. I am therefore looking for the right search index software (Solr or ElasticSearch) that can provide such boosting option and the right crawler (Nutch or Heritrix) that can extract ratings for certain products from other sites. What solution package would you recommend? I very much appreciate your help on this. -- View this message in context: http://lucene.472066.n3.nabble.com/Choosing-the-right-project-tp4021980.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: [DISCUSS] Adding ASF comment system to the Lucene websites
Hi, Looks handy. What about spam control? Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm - Original Message - From: Steven A Rowe sar...@syr.edu To: general@lucene.apache.org general@lucene.apache.org; d...@lucene.apache.org d...@lucene.apache.org Cc: Sent: Monday, July 9, 2012 2:38 PM Subject: [DISCUSS] Adding ASF comment system to the Lucene websites I'd like to add the new ASF comment system to the Lucene websites: https://blogs.apache.org/infra/entry/asf_comments_system_live Thoughts? Steve
Re: Licensing questions
Hi Tiffany, Apache Lucene is free. There is no corporation behind it. It is released under Apache Software License by the Apache Software Foundation. Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm From: Tiffany Karl tk...@logixhealth.com To: gene...@lucene.com gene...@lucene.com Cc: general@lucene.apache.org general@lucene.apache.org Sent: Thursday, May 10, 2012 11:42 AM Subject: Licensing questions Hi, Our company LogixHealth is looking to implement different search controls. We will be using the search for different applications starting with a portal, there will be approximately 500 users across multiple applications. We are inquiring about how much it would cost to license Lucene search and which one you would recommend. Much appreciated, Tiffany Tiffany Karl Manager, Product Management LogixHealth ▪ 8 Oak Park Drive, Bedford, MA 01730 Phone: 781.280.1566 tk...@logixhealth.commailto:tk...@logixhealth.com www.logixhealth.comhttp://www.logixhealth.com Confidentiality notice: This communication and any accompanying document(s) are confidential and privileged. They are intended for the sole use of the addressee for business pertaining to LogixHealth. If you received this transmission in error, you are advised that any disclosure, copying, distribution, or the taking of any action in reliance upon the communication is strictly prohibited.
Re: [Announce] Solr 3.5 with RankingAlgorithm 1.3, NRT support
Hi, Is there a writeup that describes how this compares to NRT support in development version of Solr? Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: Nagendra Nagarajayya nnagaraja...@transaxtions.com To: general@lucene.apache.org Sent: Tuesday, December 27, 2011 9:32 AM Subject: [Announce] Solr 3.5 with RankingAlgorithm 1.3, NRT support Hi! I am very excited to announce the availability of Solr 3.5 with RankingAlgorithm 1.3 (NRT support). The performance to add 1 million docs in NRT to the MBArtists index with 1 concurrent request thread executing *:* is about 5000 docs in 498 ms. The query performance is about 168K query requests at 4.2 ms / request. RankingAlgorithm 1.3 supports the entire Lucene Query Syntax, ± and/or boolean queries. RankingAlgorithm is very fast allows you to query a 10m wikipedia index (complete index) in 50 ms. You can get more information about NRT performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver3.x You can download Solr 3.5 with RankingAlgorithm 1.3 from here: http://solr-ra.tgels.org Please download and give the new version a try. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org
Re: Suggestions or best practices for indexing the logs
Alex, You could try compressing the content field - that might help a bit. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Alex Shneyderman a.shneyder...@gmail.com To: general@lucene.apache.org Sent: Thursday, October 13, 2011 7:21 PM Subject: Suggestions or best practices for indexing the logs Hello, everybody! I am trying to introduce faster searches to our application that sifts through the logs. And Lucene seems to be the tool to use here. The one peculiarity of the problem it seems there are few files and they contain many log statements. I avoid storing the text in the index itself. Given all this I setup indexing as follows: I iterate over a log file and for each statement in the log file I do the indexing of the statements content. Here is the java code that does field additions: NumericField startOffset = new NumericField(so, Field.Store.YES, false); startOffset.setLongValue( statement.getStartOffset() ); doc.add(startOffset); NumericField endOffset = new NumericField(eo, Field.Store.YES, false); endOffset.setLongValue( statement.getEndOffset() ); doc.add(endOffset); NumericField timestampField = new NumericField(ts, Field.Store.YES, true); timestampField.setLongValue(statement.getStatementTime().getTime()); doc.add(timestampField); doc.add(new Field(fn, fileTagName, Field.Store.YES, Field.Index.NO )); doc.add(new Field(ct, statement.getContent(), Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO)); I am getting following results (index size vs log files) with this scheme: The size of the logs is 385MB. (00:13:08) /var/tmp/logs du -ms /var/tmp/logs 385 /var/tmp/logs The size of the index is 143MB. (00:41:26) /var/tmp/index du -ms /var/tmp/index 143 /var/tmp/index Is this a normal ratio 143Mb / 385 Mb - seems like it is a bit too much (I would expect something like 1/5 - 1/7 for the index)? Is there anything I can do to move this to the desired ration? Of course what would help is the words histogram and here the top of the output of the words histogram script that I ran on the logs: Total number of words: 26935271 Number of different words: 551981 The most common words are: as 3395203 10 797708 13 797662 2011 795595 at 787365 timer 746790 ... Could anyone suggest a better way to organize index for my logs? And by better I mean more compact. Or this is as good as it gets? I tried to optimize and got a 2Mb improvement (index went from 145Mb to 143Mb). Could anyone point to an article that deals with indexing of logs? Any help, suggestions and pointers are greatly appreciated. Thanks for any and all help and cheers, Alex.
Re: Multiple Solr replicaton threads
Ram, What is x in your case and how much data needs to be replicated each time, roughly? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: bramsreddy bramsre...@gmail.com To: general@lucene.apache.org Sent: Monday, September 5, 2011 1:50 AM Subject: Multiple Solr replicaton threads Hi, I have one master-slave setup.slave pulls index from master after every x seconds.The problem is for one single replication slot two threads are getting created and trying to work parallely on same index.this is causing lock obtain time exception 2011-09-05 07:40:00,014 INFO [org.apache.solr.handler.SnapPuller] (pool-15-thread-1) Master's version: 1310981586400, generation: 10188 2011-09-05 07:40:00,014 INFO [org.apache.solr.handler.SnapPuller] (pool-15-thread-1) Slave's version: 1310981586382, generation: 10170 2011-09-05 07:40:00,014 INFO [org.apache.solr.handler.SnapPuller] (pool-15-thread-1) Starting replication process 2011-09-05 07:40:00,016 INFO [org.apache.solr.handler.SnapPuller] (pool-19-thread-1) Master's version: 1310981586400, generation: 10188 2011-09-05 07:40:00,016 INFO [org.apache.solr.handler.SnapPuller] (pool-19-thread-1) Slave's version: 1310981586393, generation: 10181 2011-09-05 07:40:00,017 INFO [org.apache.solr.handler.SnapPuller] (pool-19-thread-1) Starting replication process How can i make it to create a single thread per replication. Regards Ram -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-Solr-replicaton-threads-tp3310001p3310001.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: CLOSE_WAIT after connecting to multiple shards from a primary shard
Hi, A few things: 1) why not send this to the Solr list? 2) you talk about searching, but the code sample is about optimizing the index. 3) I don't have SolrJ API in front of me, but isn't there is CommonsSolrServe ctor that takes in a URL instead of HttpClient instance? Try that one. Otis - Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Mukunda Madhava mukunda...@gmail.com To: general@lucene.apache.org Sent: Mon, May 30, 2011 1:54:07 PM Subject: CLOSE_WAIT after connecting to multiple shards from a primary shard Hi, We are having a primary Solr shard, and multiple secondary shards. We query data from the secondary shards by specifying the shards param in the query params. But we found that after recieving the data, there are large number of CLOSE_WAIT on the secondary shards from the primary shards. Like for e.g. tcp1 0 primaryshardhost:56109 secondaryshardhost1:8090 CLOSE_WAIT tcp1 0 primaryshardhost:51049 secondaryshardhost1:8090 CLOSE_WAIT tcp1 0 primaryshardhost:49537 secondaryshardhost1:8089 CLOSE_WAIT tcp1 0 primaryshardhost:44109 secondaryshardhost2:8090 CLOSE_WAIT tcp1 0 primaryshardhost:32041 secondaryshardhost2:8090 CLOSE_WAIT tcp1 0 primaryshardhost:48533 secondaryshardhost2:8089 CLOSE_WAIT We open the Solr connections as below.. SimpleHttpConnectionManager cm = new SimpleHttpConnectionManager(true); cm.closeIdleConnections(0L); HttpClient httpClient = new HttpClient(cm); solrServer = new CommonsHttpSolrServer(url,httpClient); solrServer.optimize(); But still we see these issues. Any ideas? -- Thanks, Mukunda
Re: is query cache persisted?
Hi, Are you using raw Lucene or Solr? If Solr, your query is probably cached in the query results cache (see your solrconfig.xml). Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Yang tedd...@gmail.com To: general@lucene.apache.org Sent: Tue, April 12, 2011 1:35:19 PM Subject: is query cache persisted? I was trying to trace through the calls in Lucene, and when I invoked the same query for the second time, scorer.score() is no longer called anymore, and the query returns very fast. this seems to be even this case after I restarted tomcat, so I'm wondering: is the query cache persisted in Lucene? if so, how could I purge it? Thanks a lot Yang
Re: Number of Boolean Clauses (AND vs OR)
I believe AND will be faster, at least in cases when one of the earlier clauses doesn't actually match any docs, in which case the whole query should terminate early and not evaluate the remaining clauses. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: entdeveloper cameron.develo...@gmail.com To: general@lucene.apache.org Sent: Mon, April 11, 2011 2:50:28 PM Subject: Number of Boolean Clauses (AND vs OR) Does the type of boolean clause matter for a good number of boolean clauses? In other words, if I have a query with 20 boolean clauses, will 20 OR clauses perform any faster or slower than 20 AND clauses? I know these would perform completely different queries, but I am asking on a theoretical level. Obviously the total number matters, hence the limit of 1024 max boolean clauses -- View this message in context: http://lucene.472066.n3.nabble.com/Number-of-Boolean-Clauses-AND-vs-OR-tp2807905p2807905.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: Get last search data from SOLR
Jotta, You may want to ask on solr-user list in the future. If you are asking whether Solr can tell you what was the last document that Solr returned to the last query it executed, the answer is no. Maybe you can describe what you are trying to accomplish, so we can help you. Email solr-user though. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: jotta sobcz...@gmail.com To: general@lucene.apache.org Sent: Mon, January 17, 2011 1:51:13 AM Subject: Get last search data from SOLR Hi! I have some question about SOLR searching. Does SOLR support getting last searched (by users) resources? (from some period of time) PS Sorry for my English :) Regards Jotta -- View this message in context: http://lucene.472066.n3.nabble.com/Get-last-search-data-from-SOLR-tp2270661p2270661.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: Apache Solr is not available
Hi, I think you'll get more help if you ask Drupal community. That error message is specific to Drupal. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: nitishgarg nitishgarg1...@gmail.com To: general@lucene.apache.org Sent: Sat, December 25, 2010 2:39:18 AM Subject: Apache Solr is not available I am using Drupal and Apache Solr. I keep getting the error that Apache Solr is not available. Please contact your administrator. It works fine when installed newly. But this error crops after some time. Any suggestions? -- View this message in context: http://lucene.472066.n3.nabble.com/Apache-Solr-is-not-available-tp2143311p2143311.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: [PMC] Next Steps on Lucene.NET
Personally, I would be *very* interested whether moving Lucene.NET to GitHub will make a difference in terms of progress and style of development. Maybe forking, pull requests, and the whole social thing makes it easier for people to participate. Since Lucene.NET has struggled for years at ASF, this would be a great opportunity to see if the above makes a difference. My 0.02 NT Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Grant Ingersoll gsing...@apache.org To: general@lucene.apache.org Sent: Thu, December 23, 2010 10:41:18 AM Subject: Re: [PMC] Next Steps on Lucene.NET On Dec 21, 2010, at 12:00 PM, Chris Hostetter wrote: : point, it's either the Attic or Incubator and I'm leaning toward Attic. : However, I think it makes sense to give one more chance by saying: You : have until January 31 to put together a proposal for going back to the : Incubator. Please see http://incubator.apache.org for what such a : proposal entails. I'm not certain the attic is appropriate -- my understanding is that it's the final resting place for projects (TLP, ie: an entire PMC) that are being disolved at a foundation level via board resolution. Within a PMC, like Lucene, the decision to retire a specific sub-projects and mailing lists probably doesn't need to require a board resolution. but i could be wrong. OK, I'm not sure either. I will check. We could certainly just mothball it here, but I don't think that is necessarily what we want either. In either case, having a hard date seems like a good idea -- i thought one had been established before, but i guess not. The hard date of addressing the 4 issues was set for the end of the year. I don't think any of them have been addressed. There was a big discussion for a while, but it doesn't seem like anyone has done any of the actual work, even something as simple as updating the website. This next date, in my mind, is to make it clear that the Lucene PMC is done being responsible for Lucene.NET by Jan. 31. I am more than willing to help them move somewhere else, but it is up to them to say where that is. -Grant
Re: TF IDF values for a search term
Vikas, look at DefaultSimilarity. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: vikas kumar vikasn...@gmail.com To: general@lucene.apache.org Sent: Thu, December 16, 2010 6:53:24 AM Subject: TF IDF values for a search term Hi All, I am working on latest Lucene API. I want the TF IDF values for a search term explicitly to move. Can anybody suggest which class/method can provide me that? I went through Similarity, Explanation and Score classes but didn’t find found any method which can return the TF or IDF as integers. Regards Vikas
Re: Should I avoid MultiFieldQueryParser?
What you lose by aggregating all real fields into 1 field is the ability to give fields different scoring weights. Is a match in the post title equally important as a match in the body or in one of the comments? If yes, then aggregate. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Bob Eastbrook baconeater...@gmail.com To: general@lucene.apache.org Sent: Mon, May 17, 2010 12:49:32 AM Subject: Should I avoid MultiFieldQueryParser? Imagine a blog that needs to be searched. I first thought I'd index posts and comments using these fields: BlogPostTitle BlogPostContent BlogComment There could be any number of BlogComments. I have this working fine and use MultiFieldQueryParser to generate a query. It seems to work. A search for picnic matches that term in post titles, post contents, and comments. However, Lucene in Action (2nd edition MEAP proof, chapter 5 section 4) seems to advocate against using MultiFieldQueryParser and instead suggests using a single synthetic field to hold all searchable text. Perhaps this field would be called contents or keywords. Is this accepted to be a best practice? Should I dump a BlogPostTitle, BlogPostContent, and its BlogComments into a single field? Bob
Re: java.io.IOException: read past EOF
Jean-Michael, java-u...@lucene is a better place to ask. I'd do this: * back up your index * use CheckIndex tool (if it existed in your version of Lucene?) Maybe Luke version you are using has a mismatching Lucene version? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message From: Jean-Michel RAMSEYER jm.ramse...@greenivory.com To: general@lucene.apache.org Sent: Tue, March 23, 2010 5:36:42 PM Subject: java.io.IOException: read past EOF Hi there, I'm new in Lucene's world and I'm currently meeting a problem on an index. I'm running Lucene 2.4.1 on a Linux server with a sun jvm version 1.6.0.17b04, in which the issue href=http://issues.apache.org/jira/browse/LUCENE-1282; target=_blank http://issues.apache.org/jira/browse/LUCENE-1282 is solved. I tried to open indexes on another computer with luke but it fails too. Files segments* are empty, so is there a way to rebuild index from cfs files? Is there a way to recover this index? Thank you for your answers. Exception trace : java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:151) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:36) at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:68) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:221) at org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:95) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653) at org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:115) at org.apache.lucene.index.IndexReader.open(IndexReader.java:316) at org.apache.lucene.index.IndexReader.open(IndexReader.java:206) at org.apache.lucene.search.IndexSearcher.init(IndexSearcher.java:47) ls -lah result : total 18G drwxr-xr-x 2 tomcat tomcat 4.0K 2010-03-22 16:29 . drwxr-xr-x 121 tomcat tomcat 12K 2010-03-23 14:22 .. -rw-r--r-- 1 tomcat tomcat 1.9G 2010-03-20 13:57 _1gg2.cfs -rw-r--r-- 1 tomcat tomcat 2.0G 2010-03-20 21:45 _1yhj.cfs -rw-r--r-- 1 tomcat tomcat 1.9G 2010-03-21 04:16 _2gdz.cfs -rw-r--r-- 1 tomcat tomcat 2.0G 2010-03-21 15:00 _2y9u.cfs -rw-r--r-- 1 tomcat tomcat 2.0G 2010-03-22 03:21 _3ghg.cfs -rw-r--r-- 1 tomcat tomcat 2.0G 2010-03-22 07:09 _3xty.cfs -rw-r--r-- 1 tomcat tomcat 2.0G 2010-03-22 12:24 _4ekl.cfs -rw-r--r-- 1 tomcat tomcat 192M 2010-03-22 13:25 _4gn2.cfs -rw-r--r-- 1 tomcat tomcat 198M 2010-03-22 14:23 _4ief.cfs -rw-r--r-- 1 tomcat tomcat 195M 2010-03-22 15:14 _4kbm.cfs -rw-r--r-- 1 tomcat tomcat 21M 2010-03-22 15:18 _4kil.cfs -rw-r--r-- 1 tomcat tomcat 23M 2010-03-22 15:22 _4kop.cfs -rw-r--r-- 1 tomcat tomcat 22M 2010-03-22 15:27 _4ku0.cfs -rw-r--r-- 1 tomcat tomcat 25M 2010-03-22 15:31 _4kzb.cfs -rw-r--r-- 1 tomcat tomcat 21M 2010-03-22 15:36 _4l56.cfs -rw-r--r-- 1 tomcat tomcat 1.9M 2010-03-22 15:36 _4l5r.cfs -rw-r--r-- 1 tomcat tomcat 2.0M 2010-03-22 15:37 _4l6c.cfs -rw-r--r-- 1 tomcat tomcat 165K 2010-03-22 15:37 _4l6d.cfs -rw-r--r-- 1 tomcat tomcat 58K 2010-03-22 15:37 _4l6e.cfs -rw-r--r-- 1 tomcat tomcat 80K 2010-03-22 15:37 _4l6f.cfs -rw-r--r-- 1 tomcat tomcat 149K 2010-03-22 15:37 _4l6g.cfs -rw-r--r-- 1 tomcat tomcat 218K 2010-03-22 15:37 _4l6h.cfs -rw-r--r-- 1 tomcat tomcat 198K 2010-03-22 15:37 _4l6i.cfs -rw-r--r-- 1 tomcat tomcat 45K 2010-03-22 15:37 _4l6j.cfs -rw-r--r-- 1 tomcat tomcat 58K 2010-03-22 15:37 _4l6k.cfs -rw-r--r-- 1 tomcat tomcat 158K 2010-03-22 15:37 _4l6l.cfs -rw-r--r-- 1 tomcat tomcat 116K 2010-03-22 15:37 _4l6m.cfs -rw-r--r-- 1 tomcat tomcat 1.1M 2010-03-22 15:37 _4l6n.cfs -rw-r--r-- 1 tomcat tomcat 128K 2010-03-22 15:37 _4l6o.cfs -rw-r--r-- 1 tomcat tomcat 1.9G 2010-03-20 04:12 _hnt.cfs -rw-r--r-- 1 tomcat tomcat0 2010-03-22 15:37 segments_44o3 -rw-r--r-- 1 tomcat tomcat0 2010-03-22 15:37 segments_44o4 -rw-r--r-- 1 tomcat tomcat0 2010-03-22 15:37 segments.gen -rw-r--r-- 1 tomcat tomcat 1.9G 2010-03-20 07:52 _ywu.cfs
Re: [VOTE] merge lucene/solr development (take 3)
Would it be correct to say that in order to have a voting be perfectly clear, the VOTE thread should have just the votes and no comments/discussion? Otis - Original Message From: Grant Ingersoll gsing...@apache.org To: general@lucene.apache.org Sent: Fri, March 12, 2010 11:02:34 AM Subject: Re: [VOTE] merge lucene/solr development (take 3) On Mar 12, 2010, at 10:56 AM, Mattmann, Chris A (388J) wrote: Hi Simon, On 3/12/10 4:30 AM, Simon Willnauer ymailto=mailto:simon.willna...@googlemail.com; href=mailto:simon.willna...@googlemail.com;simon.willna...@googlemail.com wrote: I don't think that is the case. A large amount of different concerns are out there. Simply based on the amount of huge comments this seems to be not a clearly passed vote. simon Agreed. Comments are not votes. Tally up the +1, 0, and -1's. There is your vote. If people don't understand that the thing you are voting on is the first email in the [VOTE] thread, then I don't know how else to explain it. This thread very clearly has something to vote on it in the first thread. -Grant
Re: [VOTE] merge lucene/solr development (take 3)
Hi, Would it be correct to say that a subset of Lucene/Solr committers discussed the proposal internally/offline (i.e. not on MLs) before proposing it? Thanks, Otis
Re: [VOTE] merge lucene/solr development (take 3)
Hello, - Original Message From: Grant Ingersoll gsing...@apache.org To: general@lucene.apache.org Sent: Fri, March 12, 2010 12:03:07 PM Subject: Re: [VOTE] merge lucene/solr development (take 3) On Mar 12, 2010, at 11:54 AM, patrick o'leary wrote: Go look at the votes. Which ones? from vote 1 2 or 3?? 3. That is this thread. But I also recall people (Mark Miller maybe?) saying that the votes are not being counted and we are just looking to get an idea about the sentiment on this suggestion (paraphrasing him, sorry if I messed something up). Otis
Re: [VOTE] merge lucene/solr development (take 3)
Hi, But remember the early days of this (or these) vote threads. I recall some people saying things like I won't vote -1 since I don't want to veto the proposal, so I'll vote +|-0. I recall Doug being one of those people. I don't think we heard back from Doug in subsequent vote threads. I think there were a few others on the fence. I don't think I even voted because things were not clear and there was too much discussion going on. If I had to vote, I think I'd vote -1 mainly because I believe that what I think the proposal's goal is can be achieved with the current structure. I mentioned this in some emails about a week ago, but nobody from +1 side reacted from what I recall. I agree that in general in life it's impossible to get 100% of people to agree on something and sometimes that means that a largish minority will have to live with a change they disagree with, but here I feel that there are other ways of achieving the desired goal, so it's not clear to me while those less drastic ways are not tried first. I'll send a separate email about those ways. Otis - Original Message From: Michael McCandless luc...@mikemccandless.com To: general@lucene.apache.org Sent: Sun, March 14, 2010 6:28:57 AM Subject: Re: [VOTE] merge lucene/solr development (take 3) On Sun, Mar 14, 2010 at 12:26 AM, Michael Busch ymailto=mailto:busch...@gmail.com; href=mailto:busch...@gmail.com;busch...@gmail.com wrote: This whole thing feels like it's been pushed through, and while I'm not against the updated proposal anymore (I voted +0), the bad feeling that consensus wasn't really reached remains. But: this vote is not expected nor required to reach consensus. We as a community are very used to only pursuing things when they reach [near-]consensus, simply because nearly every biggish topic we discuss must first reach consensus. That's a very high bar and it blocks many good changes (look at how many times we've broached relaxing back compat policy...). This change does not require consensus. It requires only a majority to pass, which it has achieved. Yes, it's contentious, but a change this big will always be contentious, and this is why Apache requires only majority for it to pass. Mike
Re: [VOTE] merge lucene/solr development (take 3)
Hi, - Original Message From: Grant Ingersoll gsing...@apache.org To: general@lucene.apache.org Sent: Tue, March 9, 2010 5:00:42 PM Subject: Re: [VOTE] merge lucene/solr development (take 3) On Mar 9, 2010, at 12:38 PM, Otis Gospodnetic wrote: * I think Grant may be right. We don't need this discussion. Because the Solr/Lucene developer overlap is excellent, why not just start moving selected Solr code to new Lucene modules, just like Mike proposed we move Analysis from Lucene core to a new Lucene module? Note, if you read what I said again you will realize I wasn't actually proposing this. I was saying actually, that I think it would not be something that people really wanted, even though it is perfectly legal, just like poaching is perfectly legal, but isn't, in my mind a good solution. Sigh. The problem with email, I guess, especially on long threads. My feeling was that majority of people said poaching (in a very positive sense) is the way OSS works. Why can't we start with poaching/refactoring and then, in N months, evaluate both the outcome and the process and see if things can work that way in the future[*] or something more drastic should be done? Additionally, if I understand things correctly, poaching is only needed when the code is not committed in the right project/location to begin with. Otis
Re: Less drastic ways
Hello, - Original Message From: Grant Ingersoll gsing...@apache.org To: general@lucene.apache.org Sent: Sun, March 14, 2010 12:40:51 PM Subject: Re: Less drastic ways On Mar 14, 2010, at 12:28 PM, Otis Gospodnetic wrote: Hi, Consider this just an email to clarify things for Otis (and maybe a few other people). Are the following the main goals of the recent merge voting thread(s)? * Make it easier for Solr to ride the Lucene trunk * Make it easier for people to avoid committing new features to Solr when they really belong to some lower level code - either Lucene core or some Lucene module Is the only or main change being proposed that lucene-dev and solr-dev mode to some common-dev (or lucene-dev)? If the above is correct, here is what I don't understand: * Why can't Solr riding on Lucene trunk be achieved by getting Lucene trunk build into Solr lib in svn on a daily/hourly basis? GI: I just don't see that working. OG: Could you please elaborate? Also, why not try it and see? It requires very little infrastructure changes and no reorg. Reorg can always be done later if this first step proves to be inadequate. * Why can't existing Solr functionality that has been identified as should really have been committed to Lucene instead of Solr be moved to Lucene over the coming months? GI: First up is analysis, I suspect. OG: Si! * Why can't Solr developers be required to be subscribed to lucene-dev? They should. That's the immediate step going forward until the various infra gyrations are undertaken. * Why can't Solr developers be required/urged to commit any new functionality to Lucene if solr-dev and lucene-dev people think that's where it belongs? i.e. communicate before committing - the same as measure twice, cut once. GI: Of course they will. This is how committing works on any and all projects anyway. OG: Hm, again I'm confused. If this is how it worked in Solr/Lucene land, then there wouldn't be pieces in Solr that we now want to refactor and move into Lucene core or modules. A list of about 4-5 such pieces of functionality in Solr has already been listed. That's really my main question. Why were/can't things be committed to the appropriate place? Why where they committed to Solr? Thanks, Otis
Re: [VOTE] merge lucene/solr development (take 3)
Hi, - Original Message From: Yonik Seeley ysee...@gmail.com To: general@lucene.apache.org Sent: Sun, March 14, 2010 3:48:10 PM Subject: Re: [VOTE] merge lucene/solr development (take 3) On Sun, Mar 14, 2010 at 2:36 PM, Otis Gospodnetic ymailto=mailto:otis_gospodne...@yahoo.com; href=mailto:otis_gospodne...@yahoo.com;otis_gospodne...@yahoo.com wrote: if I understand things correctly, poaching is only needed when the code is not committed in the right project/location to begin with. That is the problem though - Solr should be allowed to keep whatever code was written under it's control, w/o pressure to put it in Lucene But don't we want DRY? And don't we want to take some of the goodness that evolved under Solr and modularize it, so that vanilla-Lucene users can benefit from individual pieces? (and often out of reach). Does this remain true if we get Lucene trunk jar - Solr trunk lib going on a regular (e.g. nightly) basis? And Lucene should be able to poach what it wants from Solr. But with the projects already half overlapping... it was a recipe for conflict. Poaching - right, it's just that if you build X in project A and then you want to move X to project B, it seems like more work needs to be done than if X was committed to B to begin with. We've already had conflicts about this in the past. The conflicts were either going to get worse over time, esp with Solr not on Lucene's trunk, or we were going to merge. We've decided to tear down the artificial wall and work together. Some people suggest that this could have worked w/o merging. I disagreed, as I think the majority of those voting +1 disagreed. Not sure who's following lucene-dev and solr-dev, but the committers have already been merged. We're not standing still... Hm. So there was talk of Lucene core and the new idea of Lucene modules, which are really just standalone libs/APIs/jars, right? Would it make sense to think of Solr as one such Lucene module? In other words, don't even bother with merging just the -dev lists, but really just merge everything. In that case Solr's relationship with Lucene core becomes much like the relationship Lucene contribs have with Lucene core today in terms of compatibility, builds, and committers' responsibilities? That kind of makes sense to me. Of course, because of the sheer volume we may want to keep -user lists separate and possibly even create new ones for Lucene modules that attract enough interest on their own. Otis
Re: Less drastic ways
I don't get it, Mike. :) Even if we merge Lucene/Solr and we treat Solr as just another Lucene contrib/module, say, contributors who care only about Solr will still patch against Solr and Lucene developers or those people who have the itch for that functionality being in Lucene, too, will still have to poach/refactor and pull that functionality in Lucene later on. Whether Solr is a separate project or a Lucene contrib/module that has its own user (and contributor) community that is not tightly integrated with Lucene's -dev community, the same thing will happen, no? Maybe it will help if we made things visual for us visual peeps. Is this, roughly, what the plan is: trunk/ lucene-core/ modules/ analysis/ wordnet/ spellchecker/ whatever/ ... facets/ ... functions/ solr/ dih/ ... ? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message From: Michael McCandless luc...@mikemccandless.com To: general@lucene.apache.org Sent: Sun, March 14, 2010 4:34:42 PM Subject: Re: Less drastic ways Hm, again I'm confused. If this is how it worked in Solr/Lucene land, then there wouldn't be pieces in Solr that we now want to refactor and move into Lucene core or modules. A list of about 4-5 such pieces of functionality in Solr has already been listed. That's really my main question. Why were/can't things be committed to the appropriate place? Why where they committed to Solr? Pre-merge: If someone wants a new functionality in Solr they should be free to create a patch to make it work well, in Solr, alone. To expect them to also factor it so that it works well for Lucene-only users is wrong. They should not need to, nor be expected to, and they shouldn't feel bad not having factored it that way. They use Solr and they need it working in Solr and that was their itch and they scratched it and net/net that was a great step forward for Solr. We should not up and reject contributions because they are not well factored for the two projects. Beggars can't be choosers... Someone who later has the itch for this functionality in Lucene should then be fully free to pick it up, refactor, and make it work in Lucene alone, by poaching it (pulling it into Lucene). Poaching is a natural way for code to be pulled across projects... and while in the short term it'd result in code dup, in the long term this is how refactoring can happen across projects. It's completely normal and fine, in my opinion. But poaching, while effective, is slow ... Lucene would poach, have to stabilize do a release, Solr would have to upgrade and then fix to cutover to Lucene's sources (assuming the sources hadn't diverged too much, else Solr would have to wait for Lucene's next release, etc.) And we have *alot* of modules to refactor here, between Solr and Lucene. So for these two reasons I vote for merging Solr/Lucene dev over gobbs of poaching. That gives us complete freedom to quickly move the code around. Poaching should still be perfectly fine for other cases, like pulling analyzers from Nutch, from other projects, etc. Mike
Re: [VOTE] merge lucene/solr development (take 3)
Hello, (just using Yonik's email to reply, but my comments are more general) - Original Message From: Yonik Seeley ysee...@gmail.com To: general@lucene.apache.org Sent: Tue, March 9, 2010 10:04:20 AM Subject: Re: [VOTE] merge lucene/solr development (take 3) On Tue, Mar 9, 2010 at 9:48 AM, Mattmann, Chris A (388J) wrote: I have built 10s of projects that have simply used Lucene as an API and had no need for Solr, and I've built 10s of projects where Solr made perfect sense. So, I appreciate their separation. As does everyone - which is why there will always be separate downloads. As a user, the only side affect you should see is an improved Lucene and Solr. Saying that Solr should move some stuff to Lucene for Lucene's benefit, without regard to if it's actually benefitial to Solr, is a non-starter. The lucene/solr committers have been down that road before. The solution that most committers agreed would improve the development of both projects is to merge development. * I'd completely understand the non-starter part if Lucene and Solr had disjoint sets of committers. But that's not the case. * Which is why I (like a few others) don't see why this whole thing cannot be solved by better discussion of what to develop where from the get-go * Whenever people listed features built in Solr that really should have been in Lucene, I wondered so why were not they developed in Lucene in the first place? Again, this should be possible because the same person can commit to both projects. * I hear Grant's explanation on wanting something in Solr ASAP and not wanting to commit that something to Lucene (even though it logically belongs there) because Solr is not on Lucene trunk, but isn't this just a matter of getting Lucene trunk nightly - Solr trunk lib in svn process going? * Ian is 100% right. This stuff clearly requires more discussion and a proper VOTE should wait a week or so. Otis
Re: [VOTE] merge lucene/solr development (take 3)
* Re poaching (aka cross-project refactoring) - I think this is the way to go. I think this is normal evolution of OSS projects. I think this should be done if the functionality was not committed to the best (lowest common denominator?) project from the beginning, as in all the Solr/Lucene examples brought up * I think Grant may be right. We don't need this discussion. Because the Solr/Lucene developer overlap is excellent, why not just start moving selected Solr code to new Lucene modules, just like Mike proposed we move Analysis from Lucene core to a new Lucene module? * What do people think about doing what I wrote above as step 1 in this whole process? When that is done in N months, we can see if we can improve on it? This would also fit progress, not perfection mantra. Otis - Original Message From: Otis Gospodnetic otis_gospodne...@yahoo.com To: general@lucene.apache.org Sent: Tue, March 9, 2010 12:23:59 PM Subject: Re: [VOTE] merge lucene/solr development (take 3) Hello, (just using Yonik's email to reply, but my comments are more general) - Original Message From: Yonik Seeley To: general@lucene.apache.org Sent: Tue, March 9, 2010 10:04:20 AM Subject: Re: [VOTE] merge lucene/solr development (take 3) On Tue, Mar 9, 2010 at 9:48 AM, Mattmann, Chris A (388J) wrote: I have built 10s of projects that have simply used Lucene as an API and had no need for Solr, and I've built 10s of projects where Solr made perfect sense. So, I appreciate their separation. As does everyone - which is why there will always be separate downloads. As a user, the only side affect you should see is an improved Lucene and Solr. Saying that Solr should move some stuff to Lucene for Lucene's benefit, without regard to if it's actually benefitial to Solr, is a non-starter. The lucene/solr committers have been down that road before. The solution that most committers agreed would improve the development of both projects is to merge development. * I'd completely understand the non-starter part if Lucene and Solr had disjoint sets of committers. But that's not the case. * Which is why I (like a few others) don't see why this whole thing cannot be solved by better discussion of what to develop where from the get-go * Whenever people listed features built in Solr that really should have been in Lucene, I wondered so why were not they developed in Lucene in the first place? Again, this should be possible because the same person can commit to both projects. * I hear Grant's explanation on wanting something in Solr ASAP and not wanting to commit that something to Lucene (even though it logically belongs there) because Solr is not on Lucene trunk, but isn't this just a matter of getting Lucene trunk nightly - Solr trunk lib in svn process going? * Ian is 100% right. This stuff clearly requires more discussion and a proper VOTE should wait a week or so. Otis
Re: [VOTE] merge lucene/solr development
+1 this is software. let's try it. if it doesn't work out, we know what to do. Otis - Original Message From: Yonik Seeley yo...@apache.org To: general@lucene.apache.org Sent: Wed, March 3, 2010 5:42:38 PM Subject: [VOTE] merge lucene/solr development Many Lucene/Solr committers think that merging development would be a benefit to both projects. Separate downloads would remain (among other things), so end users would not be impacted (except for higher quality products over time). Since this is a change to Lucene/Solr project development, I'd like to get a format vote from the committers of both projects. If there are 3 +1s and more +1s than -1s, we can pass this to the Lucene PMC to ratify. -Yonik Discussion thread: http://search.lucidimagination.com/search/document/c7817932400808ad/factor_out_a_standalone_shared_analysis_package_for_nutch_solr_lucene
Re: [VOTE] merge lucene/solr development
- Original Message From: Uwe Schindler u...@thetaphi.de To: general@lucene.apache.org Sent: Thu, March 4, 2010 11:19:47 AM Subject: RE: [VOTE] merge lucene/solr development If we vote on what Mike says, I revise my vote and simply vote +/-0 to not stop progress. I have some problem with the construct but in general I am fine with merging dev lists, splitting into modules, merged committers - but not the requirement on tests pass always. In my opinion, if anything changed in lucene breaks some tests we could open an issue in solr. I think that's what's being proposed. With this proposal people wearing Solr dev hats will know they need to fix Solr sooner than they know now - Hudson will tell them on a regular basis, even if you don't spot the Solr test failing or even if you spot it, but don't enter it into JIRA because you know Hudson will tell Solr guys something in Lucene trunk changed very recenly and broke Solr. Guys, is this interpretation correct? One idea: If we really make solr depend on the new lucene lib, solr should not have lucene jars in its lib folder, but instead the nightly build should fetch the jars from the lucene hudson build. For committers working in svn, maybe some relation to rev numbers (like we do for lucene backwards tests) can be put into solrs common-build.xml so the ant script of solr can checkout the correct lucene rev and build it on they fly. I was wondering the same thing. That way svn repos don't need to be reorganized. Or maybe there is some svn repo linking trickery that's possible. Otis We are voting on this: * Merging the dev lists into a single list. * Merging committers. * When any change is committed (to a module that belongs to Solr or to Lucene), all tests must pass * Release both at once (but the specific logistics is still up for discussion) * Modularlize the sources: pull things out of Lucene's core (break out query parser, move all core queries analyzers under their contrib counterparts), pull things out of Solr's core (analyzers, queries) These things would not change: * Besides modularizing (above), the source code would remain factored into separate dirs/modules the way it is now. * Issue tracking remains separate (SOLR-XXX and LUCENE-XXX issues) * User's lists remain separate. * Web sites remain separate. * Release artifacts/jars remain separate I am fine with fixing bugs in solr that are there before the change but only appear because of the change. OK My problem is more such things like the per-segment-mega-problem because Solr was simply using Lucene incorrectly (I hope this is said too hard). You know, Lucene also used those APIs incorrectly until we cutover Lucene's search to be per-segment ;) We got lucky in that the APIs were at best ambiguous about whether the incoming reader was per-segment or not. We did not break backwards. Right, and so Solr's tests should have passed. But if we would had to repair the whole solr (which is still not finished) after the per-segment change, we were still not having per-segment search. But you won't have to fix Solr from such a change. Others (people wearing Solr hats) will. And fixing this is surely not easy possible for non-solrcore developers like me or you. Right. So even if development goes together we should still have the possibility to update lucene and if its not a backwards break but incorrect usage of Lucene's API (or assumptions on behavior of Lucene's API that are not documented or are obvious from API design - like for Filters have never to work on Top-Level Searchers and *only* use the passed in IndexReader), I would simply break solr and let the solr devs fix it in a separate issue. There would no longer be solr devs -- just devs who sometimes wear Solr hats, sometimes wear Lucene hats, sometimes both, at different times. Uwe, this is in fact the proposal -- you can break Solr (but you must pass its tests), and devs with Solr hats will fix it. It's a separate issue. Mike
NYC Search in the Cloud meetup: Jan 20
Hello, If Search Engine Integration, Deployment and Scaling in the Cloud sounds interesting to you, and you are going to be in or near New York next Wednesday (Jan 20) evening: http://www.meetup.com/NYC-Search-and-Discovery/calendar/12238220/ Sorry for dupes to those of you subscribed to multiple @lucene lists. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
Re: [VOTE] Graduate Lucene.Net as a subproject under Apache Lucene
+1 Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: George Aroush geo...@aroush.net To: general@lucene.apache.org Sent: Thu, October 8, 2009 6:04:09 PM Subject: [VOTE] Graduate Lucene.Net as a subproject under Apache Lucene Hi Folks, On behalf of Lucene.Net mentor, committers and community, this is a call for vote to graduate the Lucene.Net project (http://incubator.apache.org/lucene.net/) as a sub-project under Apache Lucene. The Lucene.Net mentor, committers, and the community have voted like so: +1 from Erik Hatcher (mentor) +1 from George Aroush (committer) +1 from Isik YIGIT (aka: DIGY) (committer) +1 from Doug Sale (committer) +1 from a total of 70+ Lucene.Net members / followers / users. (with no -1 or 0 votes) The vote result can be found here: http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200909.mb ox/%3c166a01ca3739$13947380$3abd5a...@net%3e The rationale for graduation is: * Lucene.Net has been under incubation since April 2006 (3 1/2 years now). * During incubation, Lucene.Net has: - Made, 1 official release (Incubating-Apache-Lucene.Net-2.0-004-11Mar07). - Released, as SVN tag, 18 ports of Java Lucene (from 1.9 to 2.4.0). - Released, as SVN tag, port of WordNet.Net 2.0, SpellChecker.Net 2.0, Snowball.Net 2.0, and Highlighter.Net 2.0. - Released, MSDN style documentation for the above release. - Accepted, two new committers: Isik YIGIT (DIGY) digydigy @ gmail.com and Doug Sale dsale @ myspace-inc.com were added in November 2008 (George Aroush george @ aroush.net is the original committer). - The community has grown, with a healthy followers. - Is being used by well established companies in production (I'm not sure what's the legality to mention their names here, or even if I have the complete list). - Is being used by Beagle project. * Work is already under way to port Java Lucene 2.9 to Lucene.Net 2.9 If this graduation is approved, Lucene.Net will be officially called Apache Lucene.Net Please cast your votes: [ ] +1 Graduate Lucene.Net as a sub-project under Apache Lucene. [ ] -1 Lucene.Net is not ready to graduate as a sub-project under Apache Lucene, because ... This vote will close on October 14th, 2009. Regards, -- George Aroush
Re: Index Ratio
Hi Brett, Try creating a simple MS Word document with just a single character in it. Save it as .doc and check the size. Export to PDF and check the size. I don't know exactly how big those docs will be, but I bet they'll be many, many times larger than that one byte character. Open up your index with Luke to see what's in it. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: pof melbournebeerba...@gmail.com To: general@lucene.apache.org Sent: Wednesday, June 24, 2009 8:47:39 PM Subject: Index Ratio Hi, I just completed a batch test index of ~1100 documents of various file types and I noticed that the original documents take up about 145MB but my index is only 1.7MB?? I remember reading somewhere that the typical compression rate is about 20-30% or something, but mine is a little over 1%! I'm not complaining or anything It just struck me a odd especially as I have a lot of archive files and emails with attachments that I parse as well. Has anyone else experienced something like this, I'm just curious. Cheers. Brett. -- View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24195272.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: [ORP] JIRA
And 2 brand new mailing lists you can subscribe to: openrelevance-user-subscr...@lucene.apache.org openrelevance-dev-subscr...@lucene.apache.org Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Grant Ingersoll gsing...@apache.org To: general@lucene.apache.org Sent: Wednesday, June 24, 2009 5:27:17 PM Subject: [ORP] JIRA The Open Relevance Project now has JIRA setup: https://issues.apache.org/jira/secure/project/ViewProject.jspa?pid=12310943
Re: Using Lucene to index OSM nodes (400M latitude/longitude points)
Hi Kelly, I think you want to look at LocalLucene (or LocalSolr). I haven't played with Local*, so I can't provide more than this tip. Actually, I can also suggest to dump Plucene - it's a dead project, and even when it was alive it was quite slow. If you really need to be able to search from a Perl application, your best bet may be using a Perl Solr client. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Kelly Jones kelly.terry.jo...@gmail.com To: general@lucene.apache.org Sent: Tuesday, June 23, 2009 11:52:48 PM Subject: Using Lucene to index OSM nodes (400M latitude/longitude points) Can Lucene index the openstreetmap.org (OSM) node db (400M latitude/longitude pairs), and then find the 20 nodes closest to a given latitude/longitude? More specifically: % Can Lucene index numerical data and understand that 16 is close to 15, but far away from 16? % Is Lucene reasonably fast indexing 400M floating point pairs? % After Lucene creates the 400M index, can it return search results reasonably fast? % Is there a guide/tutorial that shows how to use Lucene to index numerical data (I'm using Plucene, but I'll settle for any sort of guide)? I tried to index OSM data w/ SQLite3, but it took forever. I realize I could use MySQL/PostgreSQL, but I'm looking for an embedded/serverless solution. -- We're just a Bunch Of Regular Guys, a collective group that's trying to understand and assimilate technology. We feel that resistance to new ideas and technology is unwise and ultimately futile.
Re: [ORP] Fwd: Confluence email diffs
Excellent, thanks for figuring this out. That link to diffs either wasn't there before or I never noticed it until now. But that works for me! Otis - Original Message From: Grant Ingersoll gsing...@apache.org To: general@lucene.apache.org Sent: Tuesday, June 9, 2009 11:06:42 AM Subject: [ORP] Fwd: Confluence email diffs FYI on Confluence email diffs... Begin forwarded message: From: Wendy Smoak Date: June 5, 2009 3:38:37 PM EDT To: Apache Infrastructure Subject: Re: Confluence email diffs On Fri, Jun 5, 2009 at 9:18 AM, Grant Ingersoll wrote: Does anyone know how to have Confluence email the diffs when a page is edited instead of the whole page? Right now it sends in a link to the diffs plus the whole page. I'd rather have it just mail the diffs. Last time I asked, it wasn't able to send diffs like Moin Moin used to. This is open... http://jira.atlassian.com/browse/CONF-15252 Show unix-style diffs in Text notification emails ... but it does mention Html-based watch notification emails have recently been updated to include diffs between the old and new page content.
Re: [VOTE] Make the Open Relevance Project (ORP) and official Lucene subproject
+1 Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Grant Ingersoll gsing...@apache.org To: general@lucene.apache.org Sent: Thursday, May 28, 2009 7:26:35 AM Subject: [VOTE] Make the Open Relevance Project (ORP) and official Lucene subproject I'd like to call a vote on adding the ORP as an official Lucene subproject per the proposal at http://wiki.apache.org/lucene-java/OpenRelevance with the committers specified on the Wiki page. [] +1 - Yes, I love it [] 0 - I don't care [] -1 - I don't love it Thanks, Grant
Re: RAM or File?
Yes. I remember having a very hard time showing that RAMDirectory is faster than FSDirectory back in 2004 while writing Lucene in Action No. 1. If you run the unit test that's supposed to show it, I think you'll see this. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ted Dunning ted.dunn...@gmail.com To: general@lucene.apache.org Sent: Tuesday, May 26, 2009 5:01:47 PM Subject: RAM or File? What is the current received wisdom regarding the use of a ram-based or file-based retrieval? Will file-based retrieval match ram based speed after sufficient warmup and assuming that the java memory footprint is kept small to allow maximal OS caching? -- Ted Dunning, CTO DeepDyve
benchmark contrib, wikipedia, publishing results
Been thinking about ORP on and off all day today... and Mark brought up the benchmark contrib. Shouldn't we publish Lucene results for that somewhere on the site? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
Re: Open Relevance Project?
Not sure if this was mentioned before, but hm, I was going to point out http://index.isc.org/ (see http://ioiblog.wordpress.com/2008/11/07/kicking-off-the-ioi-blog/ ), but the server doesn't seem to be listening aha, here: http://ioiblog.wordpress.com/2009/02/ Perhaps we can get data from Dennis and Jeremie? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ted Dunning ted.dunn...@gmail.com To: general@lucene.apache.org Sent: Wednesday, May 13, 2009 2:48:43 PM Subject: Re: Open Relevance Project? Crawling a reference dataset requires essentially one-time bandwidth. Also, it is possible to download, say, wikipedia in a single go. Likewise there are various web-crawls that are available for research purposes (I think). See http://webascorpus.org/ for one example. These would be single downloads. I don't entirely see the point of redoing the spidering. On Wed, May 13, 2009 at 10:56 AM, Grant Ingersoll wrote: Good point, although you never know. We also will have some bandwidth reqs for crawling. -- Ted Dunning, CTO DeepDyve
Re: Allow committers from any subproject to edit TLP site
+1 Otis - Original Message From: Grant Ingersoll gsing...@apache.org To: general@lucene.apache.org Sent: Saturday, March 21, 2009 1:04:05 PM Subject: Allow committers from any subproject to edit TLP site What do people think of allowing any subproject committer the right to edit the TLP site? I think it would make it easier for people to add news to the TLP and maybe help keep it a little fresher. -Grant
Re: PyLucene news
And now we are almost running out of space for those Lucene subproject tabs! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Michael McCandless luc...@mikemccandless.com To: general@lucene.apache.org Sent: Saturday, January 24, 2009 6:20:45 AM Subject: Re: PyLucene news Welcome to Apache, PyLucene! Mike Andi Vajda wrote: I'm pleased to announce that the PyLucene subproject now has its web site and mailing lists live: - http://lucene.apache.org/pylucene/ - http://lucene.apache.org/pylucene/resources/mailing_lists.html Please use the new pylucene-...@lucene.apache.org for discussings pertaining to this subproject. Thanks ! Andi.. ps: the JIRA project remains to be setup (INFRA-1861).
Re: Welcome PyLucene
Welcome! Do we need a new PyLucene tab now? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Andi Vajda va...@osafoundation.org To: general@lucene.apache.org Sent: Thursday, January 8, 2009 10:51:51 PM Subject: Re: Welcome PyLucene On Thu, 8 Jan 2009, Grant Ingersoll wrote: The Lucene PMC is pleased to announce the arrival of PyLucene as a Lucene subproject. PyLucene is a python based port of Lucene by Andi Vadja that was hosted at the Open Source Applications Foundation. It is automatically generated from the Lucene Java sources. Initial committers on the project are Andi Vadja and Michael McCandless, both of whom are Lucene Java committers. We are in the process of checking in the code and getting the site setup, so please bear with us as we do. PyLucene will live in SVN at http://svn.apache.org/repos/asf/lucene/pylucene/ If you wish to help, keep an eye on the Lucene website and this mailing list. We will follow up with information about PyLucene mailing lists, etc. in those locations. Welcome PyLucene! I'm very honored that PyLucene has a new home under the Apache Lucene project. Many thanks to Grant for making it happen ! Andi..
Re: Synchronization and merging indexes
Logan, My guess is you'll get more help if you post your question to the Lucene.Net mailing list (and whose address I don't recall off the top of my head). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: chaiguy1337 lo...@electricstorm.com To: general@lucene.apache.org Sent: Saturday, December 20, 2008 7:12:24 PM Subject: Synchronization and merging indexes Hi. I'm currently using Lucene.Net as the backing store for a client Windows app and it's working great, however I'm now looking at making this an occasionally-connected remote-synchronized store. In other words, I want to use one of the free online storage APIs out there that my users can subscribe to and provide login credentials, and use it to back up entire copies of the index (we're talking relatively small indexes here). The scenario should allow for multiple clients to be simultaneously modifying their local copies of the index, and therefore I will need to merge the indexes to allow for multiple sources of change. My question is first of all if anyone has any experience with this, just for some advice, but in particular I'm concerned with the merging process--does merging two indexes simply concatenate all documents in each, even if they are identical, or is there some kind of logic performed to union duplicates? If not, how should I go about doing that manually in an efficient way? I'm not terribly worried about conflicts or collisions--in the worst case I can simply duplicate the document, but I don't want duplicate copies of documents created when there is no conflict. Thanks for any advice. Logan -- View this message in context: http://www.nabble.com/Synchronization-and-merging-indexes-tp21110690p21110690.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: Local Lucene and Local Solr
1. sounds like the right choice to me. On the topic of committing early, would committing it and allowing people to svn up/co, build locally, and implement the missing pieces not get us faster to the point of being able to release it? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Grant Ingersoll [EMAIL PROTECTED] To: general@lucene.apache.org Sent: Monday, August 25, 2008 11:41:10 AM Subject: Local Lucene and Local Solr The creators of Local Lucene and Local Solr (http://www.nsshutdown.com/projects/lucene/whitepaper/locallucene.htm ) have generously agreed to donate the code to Lucene. The Lucene PMC is working through the details of the software grant. The one remaining road block, potentially, is that there is still some LGPL code involved that needs to be replaced. We could commit this before removing it, as long as we don't release it. So, if there are volunteers willing to do the work, I'd be more inclined to move forward w/ finishing out the grant and committing it. In the meantime, I would like to open the discussion of where this should live in Lucene. The options are: 1. Split them up and make them each a part of Lucene and Solr and let the committers of those projects decide where things go 2. Create a separate Geo search subproject under Lucene TLP with it's own set of committers, etc. just like any of the other sub projects (Solr, Tika, Java, etc.) This requires the PMC to vote to create a new subproject. 3. Other? So, what do people think? Where would you like to see Local Search live w.r.t. Lucene and Solr? -Grant
Re: Lucene is not able to index certain words of txt file converted form pdf
Hi, Use java-user list, there are more people on it. You need to change the setting in IndexWriter that tells Lucene how many tokens froma a document to index. By default it indexes only 10,000. I can't remember the parameter name, but look at the IndexWriter javadocs, it's right there. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: m657m [EMAIL PROTECTED] To: general@lucene.apache.org Sent: Wednesday, June 18, 2008 8:24:53 AM Subject: Lucene is not able to index certain words of txt file converted form pdf Hi I am using Lucene for indexing and searching the documents. I have an PDF (Lucene_in_action.pdf) file which i converted to txt file using PDFBox. The same txt file i indexed but while searching its not able to saerch certain words. But Lucene has given me the results if i search for other words. I am not able to find any reason for that. If any of you intellectuals can help me out in finding the reason. Thanks in advance. -- View this message in context: http://www.nabble.com/Lucene-is-not-able-to-index-certain-words-of-txt-file-converted-form-pdf-tp17981585p17981585.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: Wildcard Search over multiple fields
Hello, Wildard queries are inefficient in general. But it sounds like you simply want to combine them into a BooleanQuery where each clause is a SHOULD clause. A better place to ask is java-user list. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: jm85 [EMAIL PROTECTED] To: general@lucene.apache.org Sent: Wednesday, May 7, 2008 7:14:03 AM Subject: Wildcard Search over multiple fields Hello, What is the best method of performing a leading and trailing wildcard search over multiple fields? Currently I performing a wildcard search on one field at the time, but this potentially inefficient: WildcardQuery wildCardQuery = new WildcardQuery(new Term(searchField, * + searchText + *)); Thanks for your help, James Murphy -- View this message in context: http://www.nabble.com/Wildcard-Search-over-multiple-fields-tp17101839p17101839.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: Improving indexing and some questions
Marko, You are not getting any responses here because this general@ list is pretty empty. Please email java-user list. I mentioned this in my previous reply, but for some reason you didn't go for it. Please see http://wiki.apache.org/lucene-java/HowToContribute Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Marko Novakovic [EMAIL PROTECTED] To: general@lucene.apache.org Sent: Monday, March 24, 2008 8:17:40 PM Subject: Improving indexing and some questions Dear, I have ideas for improving indexing for web search. I have written the tutorial for IPSI conference in Opatija about ranking in search engines:The new Avenues in Web Search. I will have been published article in IPSI Magazine by October, 2008. This tutorial and my ideas was inspired by articles from IEEE, Computer Magazine, Issue August, 2007. I wrote about individual, collaborative, sponsored and mobile search and social aspects at the Web. The main idea is to implement indexing based on relational database. This database would involve evidence about users, physical and logical communities(like some enterprise, country, antonomous sysem, provider, etc.), queries, and user's clicks. The service which track and analyze user's behaviour would be also involved. Indexing will be dynamic propagated by user's recent behavior(clicks for same or similar query). Ranking would be implementad by support vector machine, which would give relevance for each query for each user. This algorithm is described in article: T. Joachims, F. Radlinski: Search Engines that Laerning from Implicit Feedback, IEEE Computer, August 2007, pp 38 Community indexin would be implemented by making relevant promotions which is described in article: B.Smyth:A Community Based Approach to Personalizing Web Search, IEEE Computer, August 2007, pp 45-46 I also deliberate some concepts, which could be implemented for indexing in sponsored and mobile search and social web. I will be honoured giving feedback from Apache's staff. Best regards __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: how to control the disk size of the indices
Hi Yannis, I don't think there is anything of that sort in Lucene, but this shouldn't be hard to do with a process outside Lucene. Of course. optimizing an index increases its size temporarily, so your external process would have to take that into account and play it safe. You could also set mergeFactor to 1, which should keep your index in a fully optimized state if you don't do any deletions and near-optimized state if you do deletions. You should discuss this on java-user list, though, so I'm CCing that list where you can continue the discussion. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Yannis Pavlidis [EMAIL PROTECTED] To: general@lucene.apache.org Sent: Monday, March 24, 2008 7:33:26 PM Subject: how to control the disk size of the indices Hi all, I wanted to ask the list whether there is an easy and efficient way to manage the size (in bytes) of a lucene index stored on disk. Basically I would like to limit lucene storing only 100 GB of information. When lucene reaches that limit then I would delete the documents (using an LRU algorithm based on timestaps) but in no case the disk space occupied by Lucene should exceed 100GB. I experimented with lucene 2.3.1 and the only I could accomplish that was by calling the optimize method (after the index size exceeded the max size) on the IndexWriter. I was looking for a more performant way to perhaps control Lucene on when to merge the segments so as to not exceed the pre-set limit. Any ideas or suggestions would be highly appreciated. Thanks in advance, Yannis.
Re: Google Summer of Code
Bok Marko, Very interested. I suggest you continue the discussion on [EMAIL PROTECTED], though (CC-ing) You should note that there are several efforts around distributed Lucene. There is SOLR-303 for distributed search, and there is some work in progress in Hadoop land around distributed indexing in a Hadoop cluster. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Marko Novakovic [EMAIL PROTECTED] To: general@lucene.apache.org Sent: Wednesday, March 19, 2008 8:02:29 PM Subject: Google Summer of Code Dear, I have idea to implement distributed version of Lucene for Google Summer of Code. Distributed version would improve speed of ranking. I have also idea to implement ranking criteria based on users' behavior and community based. If you are interested in this I will describe all details of my idea. Greetings. Marko Novakovic Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
Re: Lucene indexes in memory
Deepa, You probably want to ask on [EMAIL PROTECTED] list. Lucene reads in the whole .tii index file (see the Lucene for explanations of various Lucene index files). It doesn't read in *all* the index files, as those could be quite big. You *can* read in your index in a RAMDirectory via FSDirectory, though, and it sounds like that is what you are after. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Deepa Paranjpe [EMAIL PROTECTED] To: general@lucene.apache.org Sent: Wednesday, February 14, 2007 2:32:53 PM Subject: Lucene indexes in memory Hi all, I want to understand how lucene searches its index -- does it load the whole index into memory at once? Is there any way to make sure that it does so. I want to optimize maximally on the search time required by lucene on over ~7M short documents. The queries that I deal are 6 to 7 tokens on an average. Your help on this will be appreciated. -Deepa
Re: [Fwd: [PROPOSAL] index server project]
That's distributed indexed, built on top of Sun Grid. The project won a $50K prize. - Original Message From: Alexandru Popescu [EMAIL PROTECTED] To: general@lucene.apache.org Sent: Thursday, October 19, 2006 10:19:00 AM Subject: Re: [Fwd: [PROPOSAL] index server project] I am not sure this is (somehow) related, but I think I have noticed some project on a Sun contest (it was the big prize winner). I cannot retrieve it now, but hopefully somebody else will. ./alex -- .w( the_mindstorm )p. On 10/19/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Doug, we discussed the need of such a tool several times internally and developed some workarounds for nutch, so I would be definitely interested to contribute to such a project. Having a separated project that depends on hadoop would be the best case for our usecases. Best, Stefan Am 18.10.2006 um 23:35 schrieb Doug Cutting: FYI, I just pitched a new project you might be interested in on [EMAIL PROTECTED] Dunno if you subscribe to that list, so I'm spamming you. If it sounds interesting, please reply there. My management at Y! is interested in this, so I'm 'in'. Doug Original Message Subject: [PROPOSAL] index server project Date: Wed, 18 Oct 2006 14:17:30 -0700 From: Doug Cutting [EMAIL PROTECTED] Reply-To: general@lucene.apache.org To: general@lucene.apache.org It seems that Nutch and Solr would benefit from a shared index serving infrastructure. Other Lucene-based projects might also benefit from this. So perhaps we should start a new project to build such a thing. This could start either in java/contrib, or as a separate sub-project, depending on interest. Here are some quick ideas about how this might work. An RPC mechanism would be used to communicate between nodes (probably Hadoop's). The system would be configured with a single master node that keeps track of where indexes are located, and a number of slave nodes that would maintain, search and replicate indexes. Clients would talk to the master to find out which indexes to search or update, then they'll talk directly to slaves to perform searches and updates. Following is an outline of how this might look. We assume that, within an index, a file with a given name is written only once. Index versions are sets of files, and a new version of an index is likely to share most files with the prior version. Versions are numbered. An index server should keep old versions of each index for a while, not immediately removing old files. public class IndexVersion { String Id; // unique name of the index int version; // the version of the index } public class IndexLocation { IndexVersion indexVersion; InetSocketAddress location; } public interface ClientToMasterProtocol { IndexLocation[] getSearchableIndexes(); IndexLocation getUpdateableIndex(String id); } public interface ClientToSlaveProtocol { // normal update void addDocument(String index, Document doc); int[] removeDocuments(String index, Term term); void commitVersion(String index); // batch update void addIndex(String index, IndexLocation indexToAdd); // search SearchResults search(IndexVersion i, Query query, Sort sort, int n); } public interface SlaveToMasterProtocol { // sends currently searchable indexes // recieves updated indexes that we should replicate/update public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes); } public interface SlaveToSlaveProtocol { String[] getFileSet(IndexVersion indexVersion); byte[] getFileContent(IndexVersion indexVersion, String file); // based on experience in Hadoop, we probably wouldn't really use // RPC to send file content, but rather HTTP. } The master thus maintains the set of indexes that are available for search, keeps track of which slave should handle changes to an index and initiates index synchronization between slaves. The master can be configured to replicate indexes a specified number of times. The client library can cache the current set of searchable indexes and periodically refresh it. Searches are broadcast to one index with each id and return merged results. The client will load-balance both searches and updates. Deletions could be broadcast to all slaves. That would probably be fast enough. Alternately, indexes could be partitioned by a hash of each document's unique id, permitting deletions to be routed to the appropriate slave. Does this make sense? Does it sound like it would be useful to Solr? To Nutch? To others? Who would be interested and able to work on it? Doug ~~~ 101tec Inc. search tech for web 2.1 Menlo Park, California http://www.101tec.com
Re: [Fwd: [PROPOSAL] index server project]
Damn Y! mail shortcut. The link to the project is in my Lucene group: http://www.simpy.com/group/363 Otis - Original Message From: Alexandru Popescu [EMAIL PROTECTED] To: general@lucene.apache.org Sent: Thursday, October 19, 2006 10:19:00 AM Subject: Re: [Fwd: [PROPOSAL] index server project] I am not sure this is (somehow) related, but I think I have noticed some project on a Sun contest (it was the big prize winner). I cannot retrieve it now, but hopefully somebody else will. ./alex -- .w( the_mindstorm )p. On 10/19/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Doug, we discussed the need of such a tool several times internally and developed some workarounds for nutch, so I would be definitely interested to contribute to such a project. Having a separated project that depends on hadoop would be the best case for our usecases. Best, Stefan Am 18.10.2006 um 23:35 schrieb Doug Cutting: FYI, I just pitched a new project you might be interested in on [EMAIL PROTECTED] Dunno if you subscribe to that list, so I'm spamming you. If it sounds interesting, please reply there. My management at Y! is interested in this, so I'm 'in'. Doug Original Message Subject: [PROPOSAL] index server project Date: Wed, 18 Oct 2006 14:17:30 -0700 From: Doug Cutting [EMAIL PROTECTED] Reply-To: general@lucene.apache.org To: general@lucene.apache.org It seems that Nutch and Solr would benefit from a shared index serving infrastructure. Other Lucene-based projects might also benefit from this. So perhaps we should start a new project to build such a thing. This could start either in java/contrib, or as a separate sub-project, depending on interest. Here are some quick ideas about how this might work. An RPC mechanism would be used to communicate between nodes (probably Hadoop's). The system would be configured with a single master node that keeps track of where indexes are located, and a number of slave nodes that would maintain, search and replicate indexes. Clients would talk to the master to find out which indexes to search or update, then they'll talk directly to slaves to perform searches and updates. Following is an outline of how this might look. We assume that, within an index, a file with a given name is written only once. Index versions are sets of files, and a new version of an index is likely to share most files with the prior version. Versions are numbered. An index server should keep old versions of each index for a while, not immediately removing old files. public class IndexVersion { String Id; // unique name of the index int version; // the version of the index } public class IndexLocation { IndexVersion indexVersion; InetSocketAddress location; } public interface ClientToMasterProtocol { IndexLocation[] getSearchableIndexes(); IndexLocation getUpdateableIndex(String id); } public interface ClientToSlaveProtocol { // normal update void addDocument(String index, Document doc); int[] removeDocuments(String index, Term term); void commitVersion(String index); // batch update void addIndex(String index, IndexLocation indexToAdd); // search SearchResults search(IndexVersion i, Query query, Sort sort, int n); } public interface SlaveToMasterProtocol { // sends currently searchable indexes // recieves updated indexes that we should replicate/update public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes); } public interface SlaveToSlaveProtocol { String[] getFileSet(IndexVersion indexVersion); byte[] getFileContent(IndexVersion indexVersion, String file); // based on experience in Hadoop, we probably wouldn't really use // RPC to send file content, but rather HTTP. } The master thus maintains the set of indexes that are available for search, keeps track of which slave should handle changes to an index and initiates index synchronization between slaves. The master can be configured to replicate indexes a specified number of times. The client library can cache the current set of searchable indexes and periodically refresh it. Searches are broadcast to one index with each id and return merged results. The client will load-balance both searches and updates. Deletions could be broadcast to all slaves. That would probably be fast enough. Alternately, indexes could be partitioned by a hash of each document's unique id, permitting deletions to be routed to the appropriate slave. Does this make sense? Does it sound like it would be useful to Solr? To Nutch? To others? Who would be interested and able to work on it? Doug ~~~ 101tec Inc. search tech for web 2.1 Menlo Park, California http://www.101tec.com
Re: CLucene incubation - call for a mentor
Hi Ben, I can't volunteer, but you may want to check with Garrett Rooney. He stopped work on lucene4c, so he may be interested in helping you with moving CLucene under Apache Lucene. Otis - Original Message From: Ben van Klinken [EMAIL PROTECTED] To: general@lucene.apache.org Sent: Saturday, October 14, 2006 3:20:10 AM Subject: CLucene incubation - call for a mentor Hi, I am one of the developers of CLucene, a C++ port of Lucene. A long while back, CLucene was invited to join the ASF incubation program under Lucene. For various reasons this hasn't happend yet. But CLucene has still been happily progressing and interest in the project continues to increase - many open source projects (such as ht://dig and strigi) as well as many companies use CLucene. CLucene would of course do much better if we were part of the big happy family of Lucene and its sub-projects. However, I believe our main obstacle to this is the absence of an ASF mentor. So basically I'm asking this: would Apache Lucene still like to have us? If yes, would anyone be interested, or know of someone interested in being our mentor? Look forward to a response, Ben
Re: What are the pros and cons of using the C# version of Lucene as compared to the Java version in a .NET environment?
I have never used Lucene under Windows, but I do know that some quite high profile Internet companies have used Lucene.net port and are happy with it. See http://xanga.com Otis - Original Message From: George Carrette [EMAIL PROTECTED] To: general@lucene.apache.org Sent: Tuesday, May 9, 2006 1:00:08 PM Subject: What are the pros and cons of using the C# version of Lucene as compared to the Java version in a .NET environment? If you are developing mainly in the Microsoft.NET framework 2.0 then it seems that you have 3 choices for running Lucene. What are the pros and cons of each choice? 1. Use the C# code from the apache incubator project Lucene.Net http://incubator.apache.org/projects/lucene.net.html 2. Use the flagship project Lucene Java http://lucene.apache.org/java/docs/index.html With the Sun java runtime and define some web services in an application such As Apache Tomcat that you can call from your other .NET framework code. 3. Use the Lucene Java sources as above, but compile it using the J# compiler, such As illustrated here: http://alum.mit.edu/www/gjc/lucene-java-vjc.html I am particularly interested in risks associated with choice #3. Is the Microsoft J# compiler to be trusted? Do the people using the gcj compiler have any experience to guide somebody considering the use of a Java compiler not provided by Sun? This is for a mission-critical application at a high profile internet media company.
Re: What are the pros and cons of using the C# version of Lucene as compared to the Java version in a .NET environment?
I think all of these questions are waiting a brave soul with an itch. Got itch? Otis - Original Message From: Raghavendra Prabhu [EMAIL PROTECTED] To: general@lucene.apache.org Sent: Tuesday, May 9, 2006 4:21:54 PM Subject: Re: What are the pros and cons of using the C# version of Lucene as compared to the Java version in a .NET environment? So will the .NET port in functionality terms offer equivalent capacities as that of Java. Are there any crucial features which are missing and have not yet been implemented in Lucene.net (considering the fact that lucene java is gonna hit 2.0 soon and lucene.net is still in dev phase in 1.9) Is there any advantage in term of speed ( when you look at the .NET port) Are there any benchmark comparisons available Rgds Prabhu On 5/10/06, Monsur Hossain [EMAIL PROTECTED] wrote: As Otis mentioned we are using Lucene.NET (the recent 1.9 build) and we are quite happy with it. There were some memory leak bugs early on since .NET doesn't have an equivalent of Java's HashMap; but after those worked out, performance and scalibity has been great. The issues we are focused on now are general Lucene issues (such as scaling a large index and reducing indexing times) rather than C# vs. Java issues. I highly recommend it. Monsur Xanga.com -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 09, 2006 2:02 PM To: general@lucene.apache.org Subject: Re: What are the pros and cons of using the C# version of Lucene as compared to the Java version in a .NET environment? I have never used Lucene under Windows, but I do know that some quite high profile Internet companies have used Lucene.net port and are happy with it. See http://xanga.com Otis - Original Message From: George Carrette [EMAIL PROTECTED] To: general@lucene.apache.org Sent: Tuesday, May 9, 2006 1:00:08 PM Subject: What are the pros and cons of using the C# version of Lucene as compared to the Java version in a .NET environment? If you are developing mainly in the Microsoft.NET framework 2.0 then it seems that you have 3 choices for running Lucene. What are the pros and cons of each choice? 1. Use the C# code from the apache incubator project Lucene.Net http://incubator.apache.org/projects/lucene.net.html 2. Use the flagship project Lucene Java http://lucene.apache.org/java/docs/index.html With the Sun java runtime and define some web services in an application such As Apache Tomcat that you can call from your other .NET framework code. 3. Use the Lucene Java sources as above, but compile it using the J# compiler, such As illustrated here: http://alum.mit.edu/www/gjc/lucene-java-vjc.html I am particularly interested in risks associated with choice #3. Is the Microsoft J# compiler to be trusted? Do the people using the gcj compiler have any experience to guide somebody considering the use of a Java compiler not provided by Sun? This is for a mission-critical application at a high profile internet media company.
RE: Binary fields in index
One of the Jakarta Commons ones - jakarta.apache.org/commons/codec/ Otis --- Tricia Williams [EMAIL PROTECTED] wrote: Which library can Base64 be found in? Thanks, Tricia On Mon, 26 Sep 2005, Koji Sekiguchi wrote: You can encode (e.g. base64) the binary data to get a String and store the String. Koji -Original Message- From: Fredrik Andersson [mailto:[EMAIL PROTECTED] Sent: Monday, September 26, 2005 6:31 PM To: general@lucene.apache.org Subject: Binary fields in index Hello Gang! Is there any trick, or undocumented way, to store binary (unindexed, untokenized) data in a Lucene Field? All the Field constructors just deal with Strings. I'm currently using another database to store binary data, but it would be very neat, and more efficient, to store it directly in Lucene. Thanks in advance, Fredrik
Re: How to install lucene on windows ?
All you really need is: http://apache.oc1.mirrors.redwire.net/jakarta/lucene/binaries/lucene-1.4.3.jar Otis --- Arpit Sharma [EMAIL PROTECTED] wrote: I have installed tomcat on XP but when I go to page http://apache.oc1.mirrors.redwire.net/jakarta/lucene/binaries/ it shows lot's of files. Do I need to download all of them ? after downloading them what should I do ? Thanks __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Problem of indexing pdf files
That's a log4j warning message, because one of the PDFBox classes is trying to log something, and you don't have log4j configured appropriately. This is not a Lucene issue, and it's a warning, so you can ignore it if you want. Otis --- tirupathi reddy [EMAIL PROTECTED] wrote: Hello, I am getting the following warning message when I am indexing the pdf files using Lucene Indexing. log4j:WARN No appenders could be found for logger (org.pdfbox.pdfparser.PDFParser). log4j:WARN Please initialize the log4j system properly. This is the code I am using: if(pdf.exists()) { String text = ; try{ PDDocument document = PDDocument.load(pdf); // laden des Files PDFTextStripper pts = new PDFTextStripper(); //Extrahieren des Textes text = pts.getText(document); document.close(); } catch(IOException e){ System.out.println(File not found); } mDocument.add(Field.Text(fulltext, text)); thanx, MTREDDY Tirupati Reddy Manyam 24-06-08, Sundugaullee-24, 79110 Freiburg GERMANY. Phone: 00497618811257 cell : 004917624649007 __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: IndexWriter and IndexReader open at the same time
If you have the Lucene book, look at Chapter 2 (page 59 under section 2.9 (Concurrency, thread-safety, and locking issues) in chapter 2 (Indexing)): http://www.lucenebook.com/search?query=concurrency+rules Also, look at Lucene's Bugzilla, where you'll find a contribution that helps with concurrent IndexReader/IndexWriter usage. Otis --- Greg Love [EMAIL PROTECTED] wrote: Hello, I have an application that gets many delete and write resquests at the same time. to avoid opening and closing the IndexWriter and IndexReader everytime one of them need to do a write operation, i keep them both open and have a shared lock around them whenever i need to use them for writing. everything seems to be working in order, but i'm not sure if this is a safe thing to do. please let me know. thank you, lavafish
Re: indexing FTP or HTTP or Database
For indexing FTP and HTTP servers, see Nutch (sub-project of Lucene). For indexing a DB you can write some custom JDBC to pull your data from DB and index it with Lucene. I imagine a few other people will email suggestions ;) Otis --- Bassem Elsayed [EMAIL PROTECTED] wrote: How can I use lucene to index and search for FTP OR HTTP or Database? Thank you. Thanks and Best Regards, Bassem Elsayed Saad Software Engineer ICT Department Bibliotheca Alexandrina P.O. Box 138, Chatby Alexandria 21526, Egypt Tel: +(203) 483 , Ext: 1496 Mob: +(2010) 627 2875 Fax: +(203) 482 0405 Email: [EMAIL PROTECTED] Web Site: www.bibalex.org http://www.bibalex.org/