Re: Proposal to move/hide the general@lucene list

2013-01-24 Thread Otis Gospodnetic
+1 for getting rid of it.  Doesn't seem to serve any purpose.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Thu, Jan 24, 2013 at 11:19 AM, Smiley, David W. dsmi...@mitre.orgwrote:

  The general@lucene.apache.org list is often misused for help on Lucene
 or Solr that belong on their respective lists.

  I'm okay with the list being discontinued.  If people are not okay with
 that, then I propose modifying the page where people currently discover the
 list so that they aren't likely to use it instead of the proper list.

 http://lucene.apache.org/core/discussion.html#general-discussion-generallucenehttp://lucene.apache.org/core/discussion.html
 Perhaps a simply adding the text NOT for users seeking help with Lucene
 message in red.  I can see how users, in a hurry, can look at the existing
 description (without having red the java-user list prior) and think that
 the general list is the right place.

  ~ David



Re: Choosing the right project

2012-11-29 Thread Otis Gospodnetic
Hi,

For a neutral Solr/ES comparison look at http://blog.sematext.com/

Nutch can indexing into Solr, but I thiiink not into ES yet.

ManifoldCF has a crawler and can index into both Solr and ES, but it's crawler 
is not made for large scale crawling as is the case with Nutch.

Otis 

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 
Search Analytics - http://sematext.com/search-analytics/index.html





 From: timd timdavi...@msn.com
To: general@lucene.apache.org 
Sent: Friday, November 23, 2012 3:13 AM
Subject: Choosing the right project
 
Hello all,

I am planning to build a general purpose search engine, where rankings for
certain types of online resources (e.g. cll phones) will be boosted based on
an external rating.

I am therefore looking for the right search index software (Solr or
ElasticSearch) that can provide such boosting option and the right crawler
(Nutch or Heritrix) that can extract ratings for certain products from other
sites.

What solution package would you recommend?

I very much appreciate your help on this.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Choosing-the-right-project-tp4021980.html
Sent from the Lucene - General mailing list archive at Nabble.com.




Re: [DISCUSS] Adding ASF comment system to the Lucene websites

2012-07-10 Thread Otis Gospodnetic
Hi,

Looks handy.  What about spam control?

Otis 

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 



- Original Message -
 From: Steven A Rowe sar...@syr.edu
 To: general@lucene.apache.org general@lucene.apache.org; 
 d...@lucene.apache.org d...@lucene.apache.org
 Cc: 
 Sent: Monday, July 9, 2012 2:38 PM
 Subject: [DISCUSS] Adding ASF comment system to the Lucene websites
 
 I'd like to add the new ASF comment system to the Lucene websites:
 
 https://blogs.apache.org/infra/entry/asf_comments_system_live
 
 Thoughts?
 
 Steve



Re: Licensing questions

2012-05-13 Thread Otis Gospodnetic
Hi Tiffany,

Apache Lucene is free.  There is no corporation behind it.  It is released 
under Apache Software License by the Apache Software Foundation.

Otis 

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 




 From: Tiffany Karl tk...@logixhealth.com
To: gene...@lucene.com gene...@lucene.com 
Cc: general@lucene.apache.org general@lucene.apache.org 
Sent: Thursday, May 10, 2012 11:42 AM
Subject: Licensing questions
 
Hi,

Our company LogixHealth is looking to implement different search controls. We 
will be using the search for different applications starting with a portal, 
there will be approximately 500 users across multiple applications. We are 
inquiring about how much it would cost to license Lucene search and which one 
you would recommend.

Much appreciated,
Tiffany


Tiffany Karl

Manager, Product Management
LogixHealth ▪ 8 Oak Park Drive, Bedford, MA 01730
Phone: 781.280.1566
tk...@logixhealth.commailto:tk...@logixhealth.com
www.logixhealth.comhttp://www.logixhealth.com





Confidentiality notice: This communication and any accompanying document(s) 
are confidential and privileged.  They are intended for the sole use of the 
addressee for business pertaining to LogixHealth.  If you received this 
transmission in error, you are advised that any disclosure, copying, 
distribution, or the taking of any action in reliance upon the communication 
is strictly prohibited.








Re: [Announce] Solr 3.5 with RankingAlgorithm 1.3, NRT support

2011-12-28 Thread Otis Gospodnetic
Hi,

Is there a writeup that describes how this compares to NRT support in 
development version of Solr?


Otis


Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html




 From: Nagendra Nagarajayya nnagaraja...@transaxtions.com
To: general@lucene.apache.org 
Sent: Tuesday, December 27, 2011 9:32 AM
Subject: [Announce] Solr 3.5 with RankingAlgorithm 1.3, NRT support
 
Hi!

I am very excited to announce the availability of Solr 3.5 with 
RankingAlgorithm 1.3 (NRT support). The performance to add 1 million docs in 
NRT to the MBArtists index with 1 concurrent request thread executing *:* is 
about 5000 docs in 498 ms. The query performance is about 168K query requests 
at 4.2 ms / request.

RankingAlgorithm 1.3 supports the entire Lucene Query Syntax, ± and/or boolean 
queries.
RankingAlgorithm is very fast allows you to  query a 10m wikipedia index 
(complete index) in 50 ms.

You can get more information about NRT performance from here:
http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver3.x

You can download Solr 3.5 with RankingAlgorithm 1.3 from here:
http://solr-ra.tgels.org

Please download and give the new version a try.


Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org




Re: Suggestions or best practices for indexing the logs

2011-10-17 Thread Otis Gospodnetic
Alex,

You could try compressing the content field - that might help a bit.

Otis


Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



From: Alex Shneyderman a.shneyder...@gmail.com
To: general@lucene.apache.org
Sent: Thursday, October 13, 2011 7:21 PM
Subject: Suggestions or best practices for indexing the logs

Hello, everybody!

I am trying to introduce faster searches to our application that sifts
through the logs. And Lucene seems to be the tool to use here. The one
peculiarity of the problem it seems there are few files and they
contain many log statements. I avoid storing the text in the index
itself. Given all this I setup indexing as follows:

I iterate over a log file and for each statement in the log file I do
the indexing of the statements content.

Here is the java code that does field additions:

            NumericField startOffset = new NumericField(so,
Field.Store.YES, false);
            startOffset.setLongValue( statement.getStartOffset() );
            doc.add(startOffset);

            NumericField endOffset = new NumericField(eo,
Field.Store.YES, false);
            endOffset.setLongValue( statement.getEndOffset() );
            doc.add(endOffset);

            NumericField timestampField = new NumericField(ts,
Field.Store.YES, true);
            
timestampField.setLongValue(statement.getStatementTime().getTime());
            doc.add(timestampField);

            doc.add(new Field(fn, fileTagName, Field.Store.YES,
Field.Index.NO ));
            doc.add(new Field(ct, statement.getContent(),
Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO));

I am getting following results (index size vs log files) with this scheme:

The size of the logs is 385MB.
(00:13:08) /var/tmp/logs  du -ms /var/tmp/logs
385     /var/tmp/logs


The size of the index is 143MB.
(00:41:26) /var/tmp/index  du -ms /var/tmp/index
143     /var/tmp/index

Is this a normal ratio 143Mb / 385 Mb - seems like it is a bit too
much (I would expect something like 1/5 - 1/7 for the index)? Is there
anything I can do to move this to the desired ration? Of course what
would help is the words histogram and here the top of the output of
the words histogram script that I ran on the logs:

Total number of words: 26935271
Number of different words: 551981
The most common words are:
as      3395203
10      797708
13      797662
2011    795595
at      787365
timer   746790
...

Could anyone suggest a better way to organize index for my logs? And
by better I mean more compact. Or this is as good as it gets? I tried
to optimize and got a 2Mb improvement (index went from 145Mb to
143Mb).

Could anyone point to an article that deals with indexing of logs? Any
help, suggestions and pointers are greatly appreciated.

Thanks for any and all help and cheers,
Alex.




Re: Multiple Solr replicaton threads

2011-09-05 Thread Otis Gospodnetic
Ram,

What is x in your case and how much data needs to be replicated each time, 
roughly?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



From: bramsreddy bramsre...@gmail.com
To: general@lucene.apache.org
Sent: Monday, September 5, 2011 1:50 AM
Subject: Multiple Solr replicaton threads

Hi,

I have one master-slave setup.slave pulls index from master after every x
seconds.The problem is for one single replication slot two threads are
getting created and trying to work parallely on same index.this is causing
lock obtain time exception

2011-09-05 07:40:00,014 INFO  [org.apache.solr.handler.SnapPuller]
(pool-15-thread-1) Master's version: 1310981586400, generation: 10188
2011-09-05 07:40:00,014 INFO  [org.apache.solr.handler.SnapPuller]
(pool-15-thread-1) Slave's version: 1310981586382, generation: 10170
2011-09-05 07:40:00,014 INFO  [org.apache.solr.handler.SnapPuller]
(pool-15-thread-1) Starting replication process
2011-09-05 07:40:00,016 INFO  [org.apache.solr.handler.SnapPuller]
(pool-19-thread-1) Master's version: 1310981586400, generation: 10188
2011-09-05 07:40:00,016 INFO  [org.apache.solr.handler.SnapPuller]
(pool-19-thread-1) Slave's version: 1310981586393, generation: 10181
2011-09-05 07:40:00,017 INFO  [org.apache.solr.handler.SnapPuller]
(pool-19-thread-1) Starting replication process

How can i make it to create a single thread per replication.

Regards
Ram


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-Solr-replicaton-threads-tp3310001p3310001.html
Sent from the Lucene - General mailing list archive at Nabble.com.




Re: CLOSE_WAIT after connecting to multiple shards from a primary shard

2011-05-30 Thread Otis Gospodnetic
Hi,

A few things:
1) why not send this to the Solr list?
2) you talk about searching, but the code sample is about optimizing the index.

3) I don't have SolrJ API in front of me, but isn't there is CommonsSolrServe 
ctor that takes in a URL instead of HttpClient instance?  Try that one.

Otis
-
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Mukunda Madhava mukunda...@gmail.com
 To: general@lucene.apache.org
 Sent: Mon, May 30, 2011 1:54:07 PM
 Subject: CLOSE_WAIT after connecting to multiple shards from a primary shard
 
 Hi,
 We are having a primary Solr shard, and multiple secondary shards.  We
 query data from the secondary shards by specifying the shards param in  the
 query params.
 
 But we found that after recieving the data, there  are large number of
 CLOSE_WAIT on the secondary shards from the primary  shards.
 
 Like for e.g.
 
 tcp1   0 primaryshardhost:56109  secondaryshardhost1:8090
 CLOSE_WAIT
 tcp1   0 primaryshardhost:51049  secondaryshardhost1:8090
 CLOSE_WAIT
 tcp1   0 primaryshardhost:49537  secondaryshardhost1:8089
 CLOSE_WAIT
 tcp1   0 primaryshardhost:44109  secondaryshardhost2:8090
 CLOSE_WAIT
 tcp1   0 primaryshardhost:32041  secondaryshardhost2:8090
 CLOSE_WAIT
 tcp1   0 primaryshardhost:48533  secondaryshardhost2:8089
 CLOSE_WAIT
 
 
 We open the Solr connections  as below..
 
 SimpleHttpConnectionManager cm =  new
 SimpleHttpConnectionManager(true);
  cm.closeIdleConnections(0L);
 HttpClient  httpClient = new HttpClient(cm);
 solrServer = new  CommonsHttpSolrServer(url,httpClient);
  solrServer.optimize();
 
 But still we see these issues. Any ideas?
 -- 
 Thanks,
 Mukunda
 


Re: is query cache persisted?

2011-04-12 Thread Otis Gospodnetic
Hi,

Are you using raw Lucene or Solr?  If Solr, your query is probably cached in 
the 
query results cache (see your solrconfig.xml).

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Yang tedd...@gmail.com
 To: general@lucene.apache.org
 Sent: Tue, April 12, 2011 1:35:19 PM
 Subject: is query cache persisted?
 
 I was trying to trace through the calls in Lucene,
 and when I invoked the  same query for the second time, scorer.score()
 is no longer called  anymore,
 and the query returns very fast.
 
 this seems to be even this  case after I restarted tomcat, so I'm
 wondering: is the query cache persisted  in Lucene?
 if so, how could I purge it?
 
 Thanks a lot
 Yang
 


Re: Number of Boolean Clauses (AND vs OR)

2011-04-11 Thread Otis Gospodnetic
I believe AND will be faster, at least in cases when one of the earlier clauses 
doesn't actually match any docs, in which case the whole query should terminate 
early and not evaluate the remaining clauses.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: entdeveloper cameron.develo...@gmail.com
 To: general@lucene.apache.org
 Sent: Mon, April 11, 2011 2:50:28 PM
 Subject: Number of Boolean Clauses (AND vs OR)
 
 Does the type of boolean clause matter for a good number of boolean  clauses?
 In other words, if I have a query with 20 boolean clauses, will 20  OR
 clauses perform any faster or slower than 20 AND clauses? 
 
 I know  these would perform completely different queries, but I am asking on
 a  theoretical level. Obviously the total number matters, hence the limit  of
 1024 max boolean clauses
 
 --
 View this message in context: 
http://lucene.472066.n3.nabble.com/Number-of-Boolean-Clauses-AND-vs-OR-tp2807905p2807905.html

 Sent  from the Lucene - General mailing list archive at Nabble.com.
 


Re: Get last search data from SOLR

2011-01-18 Thread Otis Gospodnetic
Jotta,

You may want to ask on solr-user list in the future.

If you are asking whether Solr can tell you what was the last document that 
Solr 
returned to the last query it executed, the answer is no.

Maybe you can describe what you are trying to accomplish, so we can help you.  
Email solr-user though.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: jotta sobcz...@gmail.com
 To: general@lucene.apache.org
 Sent: Mon, January 17, 2011 1:51:13 AM
 Subject: Get last search data from SOLR
 
 
 Hi!
 I have some question about SOLR searching.
 Does SOLR support  getting last searched (by users) resources? (from some
 period of  time)
 
 PS Sorry for my English :)
 
 Regards
 Jotta
 -- 
 View  this message in context: 
http://lucene.472066.n3.nabble.com/Get-last-search-data-from-SOLR-tp2270661p2270661.html

 Sent  from the Lucene - General mailing list archive at Nabble.com.
 


Re: Apache Solr is not available

2010-12-25 Thread Otis Gospodnetic
Hi,

I think you'll get more help if you ask Drupal community.  That error message 
is 
specific to Drupal.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: nitishgarg nitishgarg1...@gmail.com
 To: general@lucene.apache.org
 Sent: Sat, December 25, 2010 2:39:18 AM
 Subject: Apache Solr is not available
 
 
 I am using Drupal and Apache Solr. I keep getting the error that Apache  Solr
 is not available. Please contact your administrator.
 
 It works  fine when installed newly. But this error crops after some time.
 Any  suggestions?
 -- 
 View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-is-not-available-tp2143311p2143311.html

 Sent  from the Lucene - General mailing list archive at Nabble.com.
 


Re: [PMC] Next Steps on Lucene.NET

2010-12-24 Thread Otis Gospodnetic
Personally, I would be *very* interested whether moving Lucene.NET to GitHub 
will make a difference in terms of progress and style of development.  Maybe 
forking, pull requests, and the whole social thing makes it easier for people 
to participate.  Since Lucene.NET has struggled for years at ASF, this would be 
a great opportunity to see if the above makes a difference.

My 0.02 NT

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Grant Ingersoll gsing...@apache.org
 To: general@lucene.apache.org
 Sent: Thu, December 23, 2010 10:41:18 AM
 Subject: Re: [PMC] Next Steps on Lucene.NET
 
 
 On Dec 21, 2010, at 12:00 PM, Chris Hostetter wrote:
 
  
  :  point, it's either the Attic or Incubator and I'm leaning toward Attic.  
  : However, I think it makes sense to give one more chance by  saying:  You 
  : have until January 31 to put together a proposal  for going back to the 
  : Incubator.  Please see http://incubator.apache.org for what such a 
  : proposal  entails.
  
  I'm not certain the attic is appropriate -- my  understanding is that it's 
  the final resting place for projects (TLP,  ie: an entire PMC) that are 
  being disolved at a foundation level via  board resolution.
  
  Within a PMC, like Lucene, the decision to  retire a specific sub-projects 
  and mailing lists probably doesn't need  to require a board resolution.
  
  but i could be wrong.
 
 OK,  I'm not sure either.  I will check.  We could certainly just mothball  
 it 
here, but I don't think that is necessarily what we want either.
 
  
  In either case, having a hard date seems like a good idea -- i thought  one 
  had been established before, but i guess not.
 
 The hard date  of addressing the 4 issues was set for the end of the year.   
 I 
don't think  any of them have been addressed.  There was a big discussion for 
a 
while,  but it doesn't seem like anyone has done any of the actual work, even 
something  as simple as updating the website.  This next date, in my mind, is 
to 
make  it clear that the Lucene PMC is done being responsible for Lucene.NET by 
Jan.  31.  I am more than willing to help them move somewhere else, but it is 
up  
to them to say where that is.
 
 -Grant


Re: TF IDF values for a search term

2010-12-24 Thread Otis Gospodnetic
Vikas, look at DefaultSimilarity.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: vikas kumar vikasn...@gmail.com
 To: general@lucene.apache.org
 Sent: Thu, December 16, 2010 6:53:24 AM
 Subject: TF  IDF values for a search term
 
 Hi All,
 
 
 
 I am working on latest Lucene API. I want the TF   IDF values for a search
 term explicitly to move.
 
 Can anybody suggest  which class/method can provide me that?
 
 I went through Similarity,  Explanation and Score classes but didn’t find
 found any method which can  return the TF or IDF as integers.
 
 
 
 Regards
 
 Vikas



Re: Should I avoid MultiFieldQueryParser?

2010-05-31 Thread Otis Gospodnetic
What you lose by aggregating all real fields into 1 field is the ability to 
give fields different scoring weights.
Is a match in the post title equally important as a match in the body or in one 
of the comments?
If yes, then aggregate.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Bob Eastbrook baconeater...@gmail.com
 To: general@lucene.apache.org
 Sent: Mon, May 17, 2010 12:49:32 AM
 Subject: Should I avoid MultiFieldQueryParser?
 
 Imagine a blog that needs to be searched.  I first thought I'd 
 index
posts and comments using these 
 fields:

BlogPostTitle
BlogPostContent
BlogComment

There 
 could be any number of BlogComments.

I have this working fine and use 
 MultiFieldQueryParser to generate a
query.  It seems to work.  A 
 search for picnic matches that term in
post titles, post contents, and 
 comments.

However, Lucene in Action (2nd edition MEAP proof, chapter 5 
 section
4) seems to advocate against using MultiFieldQueryParser and 
 instead
suggests using a single synthetic field to hold all searchable 
 text.
Perhaps this field would be called contents or keywords.

Is 
 this accepted to be a best practice?  Should I dump a
BlogPostTitle, 
 BlogPostContent, and its BlogComments into a single
field?

Bob


Re: java.io.IOException: read past EOF

2010-03-23 Thread Otis Gospodnetic
Jean-Michael,

java-u...@lucene is a better place to ask.

I'd do this:
* back up your index
* use CheckIndex tool (if it existed in your version of Lucene?)

Maybe Luke version you are using has a mismatching Lucene version?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
 From: Jean-Michel RAMSEYER jm.ramse...@greenivory.com
 To: general@lucene.apache.org
 Sent: Tue, March 23, 2010 5:36:42 PM
 Subject: java.io.IOException: read past EOF
 
 Hi there,

I'm new in Lucene's world and I'm currently meeting a problem 
 on an index.
I'm running Lucene 2.4.1 on a Linux server with a sun jvm 
 version  1.6.0.17b04, in which the issue 
 href=http://issues.apache.org/jira/browse/LUCENE-1282; target=_blank 
 http://issues.apache.org/jira/browse/LUCENE-1282 is solved.
I tried to 
 open indexes on another computer with luke but it fails too.
Files segments* 
 are empty, so is there a way to rebuild index from cfs files? Is there a way 
 to 
 recover this index?
Thank you for your answers.

Exception trace 
 :
java.io.IOException: read past EOF
at 
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:151)

 at 
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)

 at 
 org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:36)

 at 
 org.apache.lucene.store.IndexInput.readInt(IndexInput.java:68)

 at 
 org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:221)

 at 
 org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:95)

 at 
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653)

 at 
 org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:115)

 at 
 org.apache.lucene.index.IndexReader.open(IndexReader.java:316)

 at 
 org.apache.lucene.index.IndexReader.open(IndexReader.java:206)

 at 
 org.apache.lucene.search.IndexSearcher.init(IndexSearcher.java:47)

ls 
 -lah result :
total 18G
drwxr-xr-x   2 tomcat tomcat 4.0K 2010-03-22 
 16:29 .
drwxr-xr-x 121 tomcat tomcat  12K 2010-03-23 14:22 
 ..
-rw-r--r--   1 tomcat tomcat 1.9G 2010-03-20 13:57 
 _1gg2.cfs
-rw-r--r--   1 tomcat tomcat 2.0G 2010-03-20 21:45 
 _1yhj.cfs
-rw-r--r--   1 tomcat tomcat 1.9G 2010-03-21 04:16 
 _2gdz.cfs
-rw-r--r--   1 tomcat tomcat 2.0G 2010-03-21 15:00 
 _2y9u.cfs
-rw-r--r--   1 tomcat tomcat 2.0G 2010-03-22 03:21 
 _3ghg.cfs
-rw-r--r--   1 tomcat tomcat 2.0G 2010-03-22 07:09 
 _3xty.cfs
-rw-r--r--   1 tomcat tomcat 2.0G 2010-03-22 12:24 
 _4ekl.cfs
-rw-r--r--   1 tomcat tomcat 192M 2010-03-22 13:25 
 _4gn2.cfs
-rw-r--r--   1 tomcat tomcat 198M 2010-03-22 14:23 
 _4ief.cfs
-rw-r--r--   1 tomcat tomcat 195M 2010-03-22 15:14 
 _4kbm.cfs
-rw-r--r--   1 tomcat tomcat  21M 2010-03-22 15:18 
 _4kil.cfs
-rw-r--r--   1 tomcat tomcat  23M 2010-03-22 15:22 
 _4kop.cfs
-rw-r--r--   1 tomcat tomcat  22M 2010-03-22 15:27 
 _4ku0.cfs
-rw-r--r--   1 tomcat tomcat  25M 2010-03-22 15:31 
 _4kzb.cfs
-rw-r--r--   1 tomcat tomcat  21M 2010-03-22 15:36 
 _4l56.cfs
-rw-r--r--   1 tomcat tomcat 1.9M 2010-03-22 15:36 
 _4l5r.cfs
-rw-r--r--   1 tomcat tomcat 2.0M 2010-03-22 15:37 
 _4l6c.cfs
-rw-r--r--   1 tomcat tomcat 165K 2010-03-22 15:37 
 _4l6d.cfs
-rw-r--r--   1 tomcat tomcat  58K 2010-03-22 15:37 
 _4l6e.cfs
-rw-r--r--   1 tomcat tomcat  80K 2010-03-22 15:37 
 _4l6f.cfs
-rw-r--r--   1 tomcat tomcat 149K 2010-03-22 15:37 
 _4l6g.cfs
-rw-r--r--   1 tomcat tomcat 218K 2010-03-22 15:37 
 _4l6h.cfs
-rw-r--r--   1 tomcat tomcat 198K 2010-03-22 15:37 
 _4l6i.cfs
-rw-r--r--   1 tomcat tomcat  45K 2010-03-22 15:37 
 _4l6j.cfs
-rw-r--r--   1 tomcat tomcat  58K 2010-03-22 15:37 
 _4l6k.cfs
-rw-r--r--   1 tomcat tomcat 158K 2010-03-22 15:37 
 _4l6l.cfs
-rw-r--r--   1 tomcat tomcat 116K 2010-03-22 15:37 
 _4l6m.cfs
-rw-r--r--   1 tomcat tomcat 1.1M 2010-03-22 15:37 
 _4l6n.cfs
-rw-r--r--   1 tomcat tomcat 128K 2010-03-22 15:37 
 _4l6o.cfs
-rw-r--r--   1 tomcat tomcat 1.9G 2010-03-20 04:12 
 _hnt.cfs
-rw-r--r--   1 tomcat tomcat0 2010-03-22 15:37 
 segments_44o3
-rw-r--r--   1 tomcat tomcat0 2010-03-22 
 15:37 segments_44o4
-rw-r--r--   1 tomcat tomcat0 
 2010-03-22 15:37 segments.gen
-rw-r--r--   1 tomcat tomcat 1.9G 
 2010-03-20 07:52 _ywu.cfs


Re: [VOTE] merge lucene/solr development (take 3)

2010-03-14 Thread Otis Gospodnetic
Would it be correct to say that in order to have a voting be perfectly clear, 
the VOTE thread should have just the votes and no comments/discussion?

Otis


- Original Message 
 From: Grant Ingersoll gsing...@apache.org
 To: general@lucene.apache.org
 Sent: Fri, March 12, 2010 11:02:34 AM
 Subject: Re: [VOTE] merge lucene/solr development (take 3)
 
 
On Mar 12, 2010, at 10:56 AM, Mattmann, Chris A (388J) wrote:

 Hi 
 Simon,
 
 On 3/12/10 4:30 AM, Simon Willnauer 
 ymailto=mailto:simon.willna...@googlemail.com; 
 href=mailto:simon.willna...@googlemail.com;simon.willna...@googlemail.com
 
 wrote:
 
 I don't think that is the case. A large amount of 
 different concerns
 are out there. Simply based on the amount of 
 huge comments this
 seems to be not a clearly passed 
 vote.
 
 simon
 
 Agreed. 
 
 


Comments are not votes.  Tally up the +1, 0, and -1's.  
 There is your vote.  If people don't understand that the thing you are 
 voting on is the first email in the [VOTE] thread, then I don't know how else 
 to 
 explain it.  This thread very clearly has something to vote on it in the 
 first thread.


-Grant 



Re: [VOTE] merge lucene/solr development (take 3)

2010-03-14 Thread Otis Gospodnetic
Hi,

Would it be correct to say that a subset of Lucene/Solr committers discussed 
the proposal internally/offline (i.e. not on MLs) before proposing it?

Thanks,
Otis



Re: [VOTE] merge lucene/solr development (take 3)

2010-03-14 Thread Otis Gospodnetic
Hello,


- Original Message 
 From: Grant Ingersoll gsing...@apache.org
 To: general@lucene.apache.org
 Sent: Fri, March 12, 2010 12:03:07 PM
 Subject: Re: [VOTE] merge lucene/solr development (take 3)
 
 
On Mar 12, 2010, at 11:54 AM, patrick o'leary wrote:

 Go 
 look at the votes.
 
 Which ones? from vote 1 2 or 
 3??

3.  That is this thread.


But I also recall people (Mark Miller maybe?) saying that the votes are not 
being counted and we are just looking to get an idea about the sentiment on 
this suggestion (paraphrasing him, sorry if I messed something up).

Otis


Re: [VOTE] merge lucene/solr development (take 3)

2010-03-14 Thread Otis Gospodnetic
Hi,

But remember the early days of this (or these) vote threads.  I recall some 
people saying things like I won't vote -1 since I don't want to veto the 
proposal, so I'll vote +|-0.  I recall Doug being one of those people.  I 
don't think we heard back from Doug in subsequent vote threads.  I think there 
were a few others on the fence.

I don't think I even voted because things were not clear and there was too much 
discussion going on.  If I had to vote, I think I'd vote -1 mainly because I 
believe that what I think the proposal's goal is can be achieved with the 
current structure.  I mentioned this in some emails about a week ago, but 
nobody from +1 side reacted from what I recall.

I agree that in general in life it's impossible to get 100% of people to agree 
on something and sometimes that means that a largish minority will have to 
live with a change they disagree with, but here I feel that there are other 
ways of achieving the desired goal, so it's not clear to me while those less 
drastic ways are not tried first.  I'll send a separate email about those ways.

Otis



- Original Message 
 From: Michael McCandless luc...@mikemccandless.com
 To: general@lucene.apache.org
 Sent: Sun, March 14, 2010 6:28:57 AM
 Subject: Re: [VOTE] merge lucene/solr development (take 3)
 
 On Sun, Mar 14, 2010 at 12:26 AM, Michael Busch 
 ymailto=mailto:busch...@gmail.com; 
 href=mailto:busch...@gmail.com;busch...@gmail.com wrote:
 This 
 whole thing feels like it's been pushed through, and while I'm
 not 
 against the updated proposal anymore (I voted +0), the bad
 feeling that 
 consensus wasn't really reached remains.

But: this vote is not expected 
 nor required to reach consensus.

We as a community are very used to only 
 pursuing things when they
reach [near-]consensus, simply because nearly every 
 biggish topic we
discuss must first reach consensus.  That's a very high 
 bar and it
blocks many good changes (look at how many times we've 
 broached
relaxing back compat policy...).

This change does not require 
 consensus.  It requires only a majority
to pass, which it has 
 achieved.  Yes, it's contentious, but a change
this big will always be 
 contentious, and this is why Apache requires
only majority for it to 
 pass.

Mike


Re: [VOTE] merge lucene/solr development (take 3)

2010-03-14 Thread Otis Gospodnetic
Hi,


- Original Message 
 From: Grant Ingersoll gsing...@apache.org
 To: general@lucene.apache.org
 Sent: Tue, March 9, 2010 5:00:42 PM
 Subject: Re: [VOTE] merge lucene/solr development (take 3)
 
 
On Mar 9, 2010, at 12:38 PM, Otis Gospodnetic wrote:

 
 
 
 * I think Grant may be right.  We don't need this 
 discussion.  Because the Solr/Lucene developer overlap is excellent, why 
 not just start moving selected Solr code to new Lucene modules, just like 
 Mike 
 proposed we move Analysis from Lucene core to a new Lucene module?

Note, 
 if you read what I said again you will realize I wasn't actually proposing 
 this.  I was saying actually, that I think it would not be something that 
 people really wanted, even though it is perfectly legal, just like poaching 
 is 
 perfectly legal, but isn't, in my mind a good solution.  Sigh.  The 
 problem with email, I guess, especially on long threads.


My feeling was that majority of people said poaching (in a very positive sense) 
is the way OSS works.
Why can't we start with poaching/refactoring and then, in N months, evaluate 
both the outcome and the process and see if things can work that way in the 
future[*] or something more drastic should be done?

Additionally, if I understand things correctly, poaching is only needed when 
the code is not committed in the right project/location to begin with.

Otis


Re: Less drastic ways

2010-03-14 Thread Otis Gospodnetic
Hello,

- Original Message 
 From: Grant Ingersoll gsing...@apache.org
 To: general@lucene.apache.org
 Sent: Sun, March 14, 2010 12:40:51 PM
 Subject: Re: Less drastic ways
 
 
On Mar 14, 2010, at 12:28 PM, Otis Gospodnetic wrote:

 
 Hi,
 
 Consider this just an email to clarify things for Otis (and 
 maybe a few other people).
 
 Are the following the main goals of 
 the recent merge voting thread(s)?
 * Make it easier for Solr to ride the 
 Lucene trunk
 * Make it easier for people to avoid committing new 
 features to Solr when they really belong to some lower level code - either 
 Lucene core or some Lucene module
 
 Is the only or main change 
 being proposed that lucene-dev and solr-dev mode to some common-dev (or 
 lucene-dev)?
 
 If the above is correct, here is what I don't 
 understand:
 * Why can't Solr riding on Lucene trunk be achieved by 
 getting Lucene trunk build into Solr lib in svn on a daily/hourly 
 basis?

GI: I just don't see that working.


OG: Could you please elaborate?  Also, why not try it and see?  It requires 
very little infrastructure changes and no reorg.  Reorg can always be done 
later if this first step proves to be inadequate.

 * Why can't existing 
 Solr functionality that has been identified as should really have been 
 committed to Lucene instead of Solr be moved to Lucene over the coming 
 months?

GI: First up is analysis, I suspect.


OG: Si!

 * Why can't Solr 
 developers be required to be subscribed to lucene-dev?

They should.  
 That's the immediate step going forward until the various infra gyrations are 
 undertaken.

 * Why can't Solr developers be required/urged to commit 
 any new functionality to Lucene if solr-dev and lucene-dev people think 
 that's 
 where it belongs? i.e. communicate before committing - the same as measure 
 twice, cut once.

GI: Of course they will.  This is how committing works on any and all projects 
anyway.


OG: Hm, again I'm confused.  If this is how it worked in Solr/Lucene land, then 
there wouldn't be pieces in Solr that we now want to refactor and move into 
Lucene core or modules.  A list of about 4-5 such pieces of functionality in 
Solr has already been listed.  That's really my main question.  Why were/can't 
things be committed to the appropriate place?  Why where they committed to Solr?

Thanks,
Otis



Re: [VOTE] merge lucene/solr development (take 3)

2010-03-14 Thread Otis Gospodnetic
Hi,


- Original Message 
 From: Yonik Seeley ysee...@gmail.com
 To: general@lucene.apache.org
 Sent: Sun, March 14, 2010 3:48:10 PM
 Subject: Re: [VOTE] merge lucene/solr development (take 3)
 
 On Sun, Mar 14, 2010 at 2:36 PM, Otis Gospodnetic

 ymailto=mailto:otis_gospodne...@yahoo.com; 
 href=mailto:otis_gospodne...@yahoo.com;otis_gospodne...@yahoo.com 
 wrote:
  if I understand things correctly, poaching is only needed 
 when the code is not committed in the
 right project/location to begin 
 with.

That is the problem though - Solr should be allowed to keep 
 whatever
code was written under it's control, w/o pressure to put it in 
 Lucene

But don't we want DRY?
And don't we want to take some of the goodness that evolved under Solr and 
modularize it, so that vanilla-Lucene users can benefit from individual pieces?

 (and often out of reach).

Does this remain true if we get Lucene trunk jar - Solr trunk lib going on a 
regular (e.g. nightly) basis?

 And Lucene should be able to poach 
 what it wants from Solr.  But with the projects already half 
 overlapping... it was a recipe for conflict.


Poaching - right, it's just that if you build X in project A and then you want 
to move X to project B, it seems like more work needs to be done than if X was 
committed to B to begin with.

 We've already had 
 conflicts about this in the past.  The conflicts were either going to 
 get worse over time, esp with Solr not on Lucene's trunk, or we were going to 
 merge.  We've decided to tear down the artificial wall and work 
 together.

 Some people suggest that this could have worked w/o 
 merging.  I disagreed, as I think the majority of those voting +1 
 disagreed.

 Not sure who's following lucene-dev and solr-dev, but the 
 committers have already been merged. We're not standing 
 still...


Hm.
So there was talk of Lucene core and the new idea of Lucene modules, which are 
really just standalone libs/APIs/jars, right?
Would it make sense to think of Solr as one such Lucene module?
In other words, don't even bother with merging just the -dev lists, but really 
just merge everything.  In that case Solr's relationship with Lucene core 
becomes much like the relationship Lucene contribs have with Lucene core today 
in terms of compatibility, builds, and committers' responsibilities?

That kind of makes sense to me.  Of course, because of the sheer volume we may 
want to keep -user lists separate and possibly even create new ones for Lucene 
modules that attract enough interest on their own.

Otis



Re: Less drastic ways

2010-03-14 Thread Otis Gospodnetic
I don't get it, Mike. :)
Even if we merge Lucene/Solr and we treat Solr as just another Lucene 
contrib/module, say, contributors who care only about Solr will still patch 
against Solr and Lucene developers or those people who have the itch for that 
functionality being in Lucene, too, will still have to poach/refactor and pull 
that functionality in Lucene later on.  Whether Solr is a separate project or a 
Lucene contrib/module that has its own user (and contributor) community that is 
not tightly integrated with Lucene's -dev community, the same thing will 
happen, no?


Maybe it will help if we made things visual for us visual peeps.  Is this, 
roughly, what the plan is:

trunk/
lucene-core/
modules/
analysis/
wordnet/
spellchecker/
whatever/
...
facets/
...
functions/
solr/
dih/
...

?

Thanks,
Otis 
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
 From: Michael McCandless luc...@mikemccandless.com
 To: general@lucene.apache.org
 Sent: Sun, March 14, 2010 4:34:42 PM
 Subject: Re: Less drastic ways
 
  Hm, again I'm confused.  If this is how it worked in 
 Solr/Lucene
 land, then there wouldn't be pieces in Solr that we now want 
 to
 refactor and move into Lucene core or modules.  A list of about 
 4-5
 such pieces of functionality in Solr has already been 
 listed.
 That's really my main question.  Why were/can't things be 
 committed
 to the appropriate place?  Why where they committed to 
 Solr?

Pre-merge:

If someone wants a new functionality in Solr they 
 should be free to
create a patch to make it work well, in Solr, 
 alone.

To expect them to also factor it so that it works well for 
 Lucene-only
users is wrong.  They should not need to, nor be expected 
 to, and they
shouldn't feel bad not having factored it that way.  They 
 use Solr and
they need it working in Solr and that was their itch and 
 they
scratched it and net/net that was a great step forward for Solr.  
 We
should not up and reject contributions because they are not 
 well
factored for the two projects.  Beggars can't be 
 choosers...

Someone who later has the itch for this functionality in 
 Lucene should
then be fully free to pick it up, refactor, and make it work in 
 Lucene
alone, by poaching it (pulling it into Lucene).

Poaching is a 
 natural way for code to be pulled across projects... and
while in the short 
 term it'd result in code dup, in the long term this
is how refactoring can 
 happen across projects.  It's completely normal
and fine, in my 
 opinion.

But poaching, while effective, is slow ... Lucene would poach, 
 have
to stabilize  do a release, Solr would have to upgrade and then 
 fix
to cutover to Lucene's sources (assuming the sources hadn't
diverged 
 too much, else Solr would have to wait for Lucene's next
release, 
 etc.)

And we have *alot* of modules to refactor here, between Solr 
 and
Lucene.

So for these two reasons I vote for merging Solr/Lucene 
 dev over gobbs
of poaching.  That gives us complete freedom to quickly 
 move the code
around.

Poaching should still be perfectly fine for 
 other cases, like pulling
analyzers from Nutch, from other projects, 
 etc.

Mike


Re: [VOTE] merge lucene/solr development (take 3)

2010-03-09 Thread Otis Gospodnetic
Hello,

(just using Yonik's email to reply, but my comments are more general)


- Original Message 
 From: Yonik Seeley ysee...@gmail.com
 To: general@lucene.apache.org
 Sent: Tue, March 9, 2010 10:04:20 AM
 Subject: Re: [VOTE] merge lucene/solr development (take 3)
 
 On Tue, Mar 9, 2010 at 9:48 AM, Mattmann, Chris A (388J)
 wrote:
  I have built 10s of projects that
  have simply used Lucene as an API and had no need for Solr, and I've built
  10s of projects where Solr made perfect sense. So, I appreciate their
  separation.
 
 As does everyone - which is why there will always be separate
 downloads.  As a user, the only side affect you should see is an
 improved Lucene and Solr.
 
 Saying that Solr should move some stuff to Lucene for Lucene's
 benefit, without regard to if it's actually benefitial to Solr, is a
 non-starter.  The lucene/solr committers have been down that road
 before.  The solution that most committers agreed would improve the
 development of both projects is to merge development.

* I'd completely understand the non-starter part if Lucene and Solr had 
disjoint sets of committers.  But that's not the case.

* Which is why I (like a few others) don't see why this whole thing cannot be 
solved by better discussion of what to develop where from the get-go

* Whenever people listed features built in Solr that really should have been in 
Lucene, I wondered so why were not they developed in Lucene in the first 
place?  Again, this should be possible because the same person can commit to 
both projects.

* I hear Grant's explanation on wanting something in Solr ASAP and not wanting 
to commit that something to Lucene (even though it logically belongs there) 
because Solr is not on Lucene trunk, but isn't this just a matter of getting 
Lucene trunk nightly - Solr trunk lib in svn process going?

* Ian is 100% right.  This stuff clearly requires more discussion and a proper 
VOTE should wait a week or so.

Otis


Re: [VOTE] merge lucene/solr development (take 3)

2010-03-09 Thread Otis Gospodnetic
* Re poaching (aka cross-project refactoring) - I think this is the way to go.  
I think this is normal evolution of OSS projects.  I think this should be done 
if the functionality was not committed to the best (lowest common denominator?) 
project from the beginning, as in all the Solr/Lucene examples brought up

* I think Grant may be right.  We don't need this discussion.  Because the 
Solr/Lucene developer overlap is excellent, why not just start moving selected 
Solr code to new Lucene modules, just like Mike proposed we move Analysis from 
Lucene core to a new Lucene module?

* What do people think about doing what I wrote above as step 1 in this whole 
process?  When that is done in N months, we can see if we can improve on it?  
This would also fit progress, not perfection mantra.

Otis




- Original Message 
 From: Otis Gospodnetic otis_gospodne...@yahoo.com
 To: general@lucene.apache.org
 Sent: Tue, March 9, 2010 12:23:59 PM
 Subject: Re: [VOTE] merge lucene/solr development (take 3)
 
 Hello,
 
 (just using Yonik's email to reply, but my comments are more general)
 
 
 - Original Message 
  From: Yonik Seeley 
  To: general@lucene.apache.org
  Sent: Tue, March 9, 2010 10:04:20 AM
  Subject: Re: [VOTE] merge lucene/solr development (take 3)
  
  On Tue, Mar 9, 2010 at 9:48 AM, Mattmann, Chris A (388J)
  wrote:
   I have built 10s of projects that
   have simply used Lucene as an API and had no need for Solr, and I've built
   10s of projects where Solr made perfect sense. So, I appreciate their
   separation.
  
  As does everyone - which is why there will always be separate
  downloads.  As a user, the only side affect you should see is an
  improved Lucene and Solr.
  
  Saying that Solr should move some stuff to Lucene for Lucene's
  benefit, without regard to if it's actually benefitial to Solr, is a
  non-starter.  The lucene/solr committers have been down that road
  before.  The solution that most committers agreed would improve the
  development of both projects is to merge development.
 
 * I'd completely understand the non-starter part if Lucene and Solr had 
 disjoint sets of committers.  But that's not the case.
 
 * Which is why I (like a few others) don't see why this whole thing cannot be 
 solved by better discussion of what to develop where from the get-go
 
 * Whenever people listed features built in Solr that really should have been 
 in 
 Lucene, I wondered so why were not they developed in Lucene in the first 
 place?  Again, this should be possible because the same person can commit to 
 both projects.
 
 * I hear Grant's explanation on wanting something in Solr ASAP and not 
 wanting 
 to commit that something to Lucene (even though it logically belongs there) 
 because Solr is not on Lucene trunk, but isn't this just a matter of getting 
 Lucene trunk nightly - Solr trunk lib in svn process going?
 
 * Ian is 100% right.  This stuff clearly requires more discussion and a 
 proper 
 VOTE should wait a week or so.
 
 Otis



Re: [VOTE] merge lucene/solr development

2010-03-04 Thread Otis Gospodnetic
+1

this is software.  let's try it.  if it doesn't work out, we know what to do.

Otis



- Original Message 
 From: Yonik Seeley yo...@apache.org
 To: general@lucene.apache.org
 Sent: Wed, March 3, 2010 5:42:38 PM
 Subject: [VOTE] merge lucene/solr development
 
 Many Lucene/Solr committers think that merging development would be a
 benefit to both projects.
 Separate downloads would remain (among other things), so end users
 would not be impacted (except for higher quality products over time).
 Since this is a change to Lucene/Solr project development, I'd like to
 get a format vote from the committers of both projects.
 If there are 3 +1s and more +1s than -1s, we can pass this to the
 Lucene PMC to ratify.
 
 -Yonik
 
 Discussion thread:
 http://search.lucidimagination.com/search/document/c7817932400808ad/factor_out_a_standalone_shared_analysis_package_for_nutch_solr_lucene



Re: [VOTE] merge lucene/solr development

2010-03-04 Thread Otis Gospodnetic
- Original Message 

 From: Uwe Schindler u...@thetaphi.de
 To: general@lucene.apache.org
 Sent: Thu, March 4, 2010 11:19:47 AM
 Subject: RE: [VOTE] merge lucene/solr development
 
 If we vote on what Mike says, I revise my vote and simply vote +/-0 to not 
 stop 
 progress. I have some problem with the construct but in general I am fine 
 with 
 merging dev lists, splitting into modules, merged committers - but not the 
 requirement on tests pass always. In my opinion, if anything changed in 
 lucene 
 breaks some tests we could open an issue in solr.

I think that's what's being proposed.  With this proposal people wearing Solr 
dev hats will know they need to fix Solr sooner than they know now - Hudson 
will tell them on a regular basis, even if you don't spot the Solr test failing 
or even if you spot it, but don't enter it into JIRA because you know Hudson 
will tell Solr guys something in Lucene trunk changed very recenly and broke 
Solr.

Guys, is this interpretation correct?

 One idea: If we really make solr depend on the new lucene lib, solr should 
 not 
 have lucene jars in its lib folder, but instead the nightly build should 
 fetch 
 the jars from the lucene hudson build. For committers working in svn, maybe 
 some 
 relation to rev numbers (like we do for lucene backwards tests) can be put 
 into 
 solrs common-build.xml so the ant script of solr can checkout the correct 
 lucene 
 rev and build it on they fly.

I was wondering the same thing.  That way svn repos don't need to be 
reorganized.  Or maybe there is some svn repo linking trickery that's possible.

Otis

  We are voting on this:
  
   * Merging the dev lists into a single list.
  
   * Merging committers.
  
   * When any change is committed (to a module that belongs to Solr or
 to Lucene), all tests must pass
  
   * Release both at once (but the specific logistics is still up for
 discussion)
  
   * Modularlize the sources: pull things out of Lucene's core (break
 out query parser, move all core queries  analyzers under their
 contrib counterparts), pull things out of Solr's core (analyzers,
 queries)
  
  These things would not change:
  
   * Besides modularizing (above), the source code would remain factored
 into separate dirs/modules the way it is now.
  
   * Issue tracking remains separate (SOLR-XXX and LUCENE-XXX issues)
  
   * User's lists remain separate.
  
   * Web sites remain separate.
  
   * Release artifacts/jars remain separate
  
   I am fine with fixing bugs in solr that are there before the change
   but only appear because of the change.
  
  OK
  
   My problem is more such things like the per-segment-mega-problem
   because Solr was simply using Lucene incorrectly (I hope this is
   said too hard).
  
  You know, Lucene also used those APIs incorrectly until we cutover
  Lucene's search to be per-segment ;)  We got lucky in that the APIs
  were at best ambiguous about whether the incoming reader was
  per-segment or not.
  
   We did not break backwards.
  
  Right, and so Solr's tests should have passed.
  
   But if we would had to repair the whole solr (which is still not
   finished) after the per-segment change, we were still not having
   per-segment search.
  
  But you won't have to fix Solr from such a change.  Others (people
  wearing Solr hats) will.
  
   And fixing this is surely not easy possible for non-solrcore
   developers like me or you.
  
  Right.
  
   So even if development goes together we should still have the
   possibility to update lucene and if its not a backwards break but
   incorrect usage of Lucene's API (or assumptions on behavior of
   Lucene's API that are not documented or are obvious from API design
   - like for Filters have never to work on Top-Level Searchers and
   *only* use the passed in IndexReader), I would simply break solr and
   let the solr devs fix it in a separate issue.
  
  There would no longer be solr devs -- just devs who sometimes wear
  Solr hats, sometimes wear Lucene hats, sometimes both, at different
  times.
  
  Uwe, this is in fact the proposal -- you can break Solr (but you must
  pass its tests), and devs with Solr hats will fix it.  It's a separate
  issue.
  
  Mike



NYC Search in the Cloud meetup: Jan 20

2010-01-12 Thread Otis Gospodnetic
Hello,

If Search Engine Integration, Deployment and Scaling in the Cloud sounds 
interesting to you, and you are going to be in or near New York next Wednesday 
(Jan 20) evening:

http://www.meetup.com/NYC-Search-and-Discovery/calendar/12238220/

Sorry for dupes to those of you subscribed to multiple @lucene lists.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



Re: [VOTE] Graduate Lucene.Net as a subproject under Apache Lucene

2009-10-09 Thread Otis Gospodnetic
+1

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: George Aroush geo...@aroush.net
 To: general@lucene.apache.org
 Sent: Thu, October 8, 2009 6:04:09 PM
 Subject: [VOTE] Graduate Lucene.Net as a subproject under Apache Lucene
 
 Hi Folks,
 
 
 
 On behalf of Lucene.Net mentor, committers and community, this is a call for
 vote to graduate the Lucene.Net project
 (http://incubator.apache.org/lucene.net/) as a sub-project under Apache
 Lucene.
 
 
 
 The Lucene.Net mentor, committers, and the community have voted like so:
 
 
 
   +1 from Erik Hatcher (mentor)
 
   +1 from George Aroush (committer)
 
   +1 from Isik YIGIT (aka: DIGY) (committer)
 
   +1 from Doug Sale (committer)
 
   +1 from a total of 70+ Lucene.Net members / followers / users.
 
 
 
 (with no -1 or 0 votes)
 
 
 
 The vote result can be found here:
 http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200909.mb
 ox/%3c166a01ca3739$13947380$3abd5a...@net%3e
 
 
 
 The rationale for graduation is:
 
   * Lucene.Net has been under incubation since April 2006 (3 1/2 years now).
 
   * During incubation, Lucene.Net has:
 
 - Made, 1 official release
 (Incubating-Apache-Lucene.Net-2.0-004-11Mar07).
 
 - Released, as SVN tag, 18 ports of Java Lucene (from 1.9 to 2.4.0).
 
 - Released, as SVN tag, port of WordNet.Net 2.0, SpellChecker.Net 2.0,
 Snowball.Net 2.0, and Highlighter.Net 2.0.
 
 - Released, MSDN style documentation for the above release.
 
 - Accepted, two new committers: Isik YIGIT (DIGY) digydigy @ gmail.com
 and Doug Sale dsale @ myspace-inc.com were added in November 2008 (George
 Aroush george @ aroush.net is the original committer).
 
 - The community has grown, with a healthy followers.
 
 - Is being used by well established companies in production (I'm not
 sure what's the legality to mention their names here, or even if I have the
 complete list).
 
 - Is being used by Beagle project.
 
   * Work is already under way to port Java Lucene 2.9 to Lucene.Net 2.9
 
 
 
 If this graduation is approved, Lucene.Net will be officially called Apache
 Lucene.Net
 
 
 
 Please cast your votes:
 
 [ ]  +1  Graduate Lucene.Net as a sub-project under Apache Lucene.
 
 [ ]  -1  Lucene.Net is not ready to graduate as a sub-project under Apache
 Lucene, because ...
 
 
 
 This vote will close on October 14th, 2009.
 
 
 
 Regards,
 
 
 
 -- George Aroush



Re: Index Ratio

2009-06-24 Thread Otis Gospodnetic

Hi Brett,

Try creating a simple MS Word document with just a single character in it.  
Save it as .doc and check the size.  Export to PDF and check the size.  I don't 
know exactly how big those docs will be, but I bet they'll be many, many times 
larger than that one byte character.  Open up your index with Luke to see 
what's in it.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: pof melbournebeerba...@gmail.com
 To: general@lucene.apache.org
 Sent: Wednesday, June 24, 2009 8:47:39 PM
 Subject: Index Ratio
 
 
 Hi, I just completed a batch test index of ~1100 documents of various file
 types and I noticed that the original documents take up about 145MB but my
 index is only 1.7MB?? I remember reading somewhere that the typical
 compression rate is about 20-30% or something, but mine is a little over 1%!
 I'm not complaining or anything It just struck me a odd especially as I have
 a lot of archive files and emails with attachments that I parse as well. Has
 anyone else experienced something like this, I'm just curious.
 
 Cheers. Brett.
 -- 
 View this message in context: 
 http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
 Sent from the Lucene - General mailing list archive at Nabble.com.



Re: [ORP] JIRA

2009-06-24 Thread Otis Gospodnetic

And 2 brand new mailing lists you can subscribe to:
openrelevance-user-subscr...@lucene.apache.org 
openrelevance-dev-subscr...@lucene.apache.org 

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Grant Ingersoll gsing...@apache.org
 To: general@lucene.apache.org
 Sent: Wednesday, June 24, 2009 5:27:17 PM
 Subject: [ORP] JIRA
 
 The Open Relevance Project now has JIRA setup: 
 https://issues.apache.org/jira/secure/project/ViewProject.jspa?pid=12310943



Re: Using Lucene to index OSM nodes (400M latitude/longitude points)

2009-06-23 Thread Otis Gospodnetic

Hi Kelly,

I think you want to look at LocalLucene (or LocalSolr).  I haven't played with 
Local*, so I can't provide more than this tip.  Actually, I can also suggest to 
dump Plucene - it's a dead project, and even when it was alive it was quite 
slow.  If you really need to be able to search from a Perl application, your 
best bet may be using a Perl Solr client.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Kelly Jones kelly.terry.jo...@gmail.com
 To: general@lucene.apache.org
 Sent: Tuesday, June 23, 2009 11:52:48 PM
 Subject: Using Lucene to index OSM nodes (400M latitude/longitude points)
 
 Can Lucene index the openstreetmap.org (OSM) node db (400M
 latitude/longitude pairs), and then find the 20 nodes closest to a
 given latitude/longitude?
 
 More specifically:
 
 % Can Lucene index numerical data and understand that 16 is close to
 15, but far away from 16?
 
 % Is Lucene reasonably fast indexing 400M floating point pairs?
 
 % After Lucene creates the 400M index, can it return search results
 reasonably fast?
 
 % Is there a guide/tutorial that shows how to use Lucene to index
 numerical data (I'm using Plucene, but I'll settle for any sort of
 guide)?
 
 I tried to index OSM data w/ SQLite3, but it took forever.
 
 I realize I could use MySQL/PostgreSQL, but I'm looking for an
 embedded/serverless solution.
 
 -- 
 We're just a Bunch Of Regular Guys, a collective group that's trying
 to understand and assimilate technology. We feel that resistance to
 new ideas and technology is unwise and ultimately futile.



Re: [ORP] Fwd: Confluence email diffs

2009-06-10 Thread Otis Gospodnetic

Excellent, thanks for figuring this out.  That link to diffs either wasn't 
there before or I never noticed it until now.  But that works for me!

Otis



- Original Message 
 From: Grant Ingersoll gsing...@apache.org
 To: general@lucene.apache.org
 Sent: Tuesday, June 9, 2009 11:06:42 AM
 Subject: [ORP] Fwd: Confluence email diffs
 
 FYI on Confluence email diffs...
 
 Begin forwarded message:
 
  From: Wendy Smoak 
  Date: June 5, 2009 3:38:37 PM EDT
  To: Apache Infrastructure 
  Subject: Re: Confluence email diffs
  
  On Fri, Jun 5, 2009 at 9:18 AM, Grant Ingersoll wrote:
  
  Does anyone know how to have Confluence email the diffs when a page is
  edited instead of the whole page?  Right now it sends in a link to the 
  diffs
  plus the whole page.  I'd rather have it just mail the diffs.
  
  Last time I asked, it wasn't able to send diffs like Moin Moin used to.
  
  This is open...
  http://jira.atlassian.com/browse/CONF-15252 Show unix-style diffs in
  Text notification emails
  
  ... but it does mention Html-based watch notification emails have
  recently been updated to include diffs between the old and new page
  content.
  



Re: [VOTE] Make the Open Relevance Project (ORP) and official Lucene subproject

2009-05-28 Thread Otis Gospodnetic

+1

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Grant Ingersoll gsing...@apache.org
 To: general@lucene.apache.org
 Sent: Thursday, May 28, 2009 7:26:35 AM
 Subject: [VOTE] Make the Open Relevance Project (ORP) and official Lucene 
 subproject
 
 I'd like to call a vote on adding the ORP as an official Lucene subproject 
 per 
 the proposal at http://wiki.apache.org/lucene-java/OpenRelevance with the 
 committers specified on the Wiki page.
 
 [] +1 - Yes, I love it
 [] 0 - I don't care
 [] -1 - I don't love it
 
 Thanks,
 Grant



Re: RAM or File?

2009-05-26 Thread Otis Gospodnetic

Yes.  I remember having a very hard time showing that RAMDirectory is faster 
than FSDirectory back in 2004 while writing Lucene in Action No. 1.  If you run 
the unit test that's supposed to show it, I think you'll see this.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Ted Dunning ted.dunn...@gmail.com
 To: general@lucene.apache.org
 Sent: Tuesday, May 26, 2009 5:01:47 PM
 Subject: RAM or File?
 
 What is the current received wisdom regarding the use of a ram-based or
 file-based retrieval?
 
 Will file-based retrieval match ram based speed after sufficient warmup and
 assuming that the java memory footprint is kept small to allow maximal OS
 caching?
 
 -- 
 Ted Dunning, CTO
 DeepDyve



benchmark contrib, wikipedia, publishing results

2009-05-18 Thread Otis Gospodnetic

Been thinking about ORP on and off all day today... and Mark brought up the 
benchmark contrib.  Shouldn't we publish Lucene results for that somewhere on 
the site?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



Re: Open Relevance Project?

2009-05-17 Thread Otis Gospodnetic

Not sure if this was mentioned before, but  hm, I was going to point out 
http://index.isc.org/ (see 
http://ioiblog.wordpress.com/2008/11/07/kicking-off-the-ioi-blog/ ), but the 
server doesn't seem to be listening aha, here: 
http://ioiblog.wordpress.com/2009/02/

Perhaps we can get data from Dennis and Jeremie?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Ted Dunning ted.dunn...@gmail.com
 To: general@lucene.apache.org
 Sent: Wednesday, May 13, 2009 2:48:43 PM
 Subject: Re: Open Relevance Project?
 
 Crawling a reference dataset requires essentially one-time bandwidth.
 
 Also, it is possible to download, say, wikipedia in a single go.  Likewise
 there are various web-crawls that are available for research purposes (I
 think).  See http://webascorpus.org/ for one example.  These would be single
 downloads.
 
 I don't entirely see the point of redoing the spidering.
 
 On Wed, May 13, 2009 at 10:56 AM, Grant Ingersoll wrote:
 
  Good point, although you never know.  We also will have some bandwidth reqs
  for crawling.
 
 
 
 
 -- 
 Ted Dunning, CTO
 DeepDyve



Re: Allow committers from any subproject to edit TLP site

2009-03-24 Thread Otis Gospodnetic

+1

Otis




- Original Message 
 From: Grant Ingersoll gsing...@apache.org
 To: general@lucene.apache.org
 Sent: Saturday, March 21, 2009 1:04:05 PM
 Subject: Allow committers from any subproject to edit TLP site
 
 What do people think of allowing any subproject committer the right to edit 
 the 
 TLP site?  I think it would make it easier for people to add news to the TLP 
 and 
 maybe help keep it a little fresher.
 
 -Grant



Re: PyLucene news

2009-01-27 Thread Otis Gospodnetic
And now we are almost running out of space for those Lucene subproject tabs! :)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Michael McCandless luc...@mikemccandless.com
 To: general@lucene.apache.org
 Sent: Saturday, January 24, 2009 6:20:45 AM
 Subject: Re: PyLucene news
 
 
 Welcome to Apache, PyLucene!
 
 Mike
 
 Andi Vajda wrote:
 
  
  I'm pleased to announce that the PyLucene subproject now has its web site 
  and 
 mailing lists live:
  
   - http://lucene.apache.org/pylucene/
   - http://lucene.apache.org/pylucene/resources/mailing_lists.html
  
  Please use the new pylucene-...@lucene.apache.org for discussings 
  pertaining 
 to this subproject.
  
  Thanks !
  
  Andi..
  
  ps: the JIRA project remains to be setup (INFRA-1861).
  



Re: Welcome PyLucene

2009-01-09 Thread Otis Gospodnetic
Welcome!

Do we need a new PyLucene tab now?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Andi Vajda va...@osafoundation.org
 To: general@lucene.apache.org
 Sent: Thursday, January 8, 2009 10:51:51 PM
 Subject: Re: Welcome PyLucene
 
 
 On Thu, 8 Jan 2009, Grant Ingersoll wrote:
 
  The Lucene PMC is pleased to announce the arrival of PyLucene as a Lucene 
 subproject.  PyLucene is a python based port of Lucene by Andi Vadja that was 
 hosted at the Open Source Applications Foundation.  It is automatically 
 generated from the Lucene Java sources.
  
  Initial committers on the project are Andi Vadja and Michael McCandless, 
  both 
 of whom are Lucene Java committers.
  
  We are in the process of checking in the code and getting the site setup, 
  so 
 please bear with us as we do.  PyLucene will live in SVN at 
 http://svn.apache.org/repos/asf/lucene/pylucene/  If you wish to help, keep 
 an 
 eye on the Lucene website and this mailing list.  We will follow up with 
 information about PyLucene mailing lists, etc. in those locations.
  
  Welcome PyLucene!
 
 I'm very honored that PyLucene has a new home under the Apache Lucene project.
 
 Many thanks to Grant for making it happen !
 
 Andi..



Re: Synchronization and merging indexes

2008-12-21 Thread Otis Gospodnetic
Logan,

My guess is you'll get more help if you post your question to the Lucene.Net 
mailing list (and whose address I don't recall off the top of my head).


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: chaiguy1337 lo...@electricstorm.com
 To: general@lucene.apache.org
 Sent: Saturday, December 20, 2008 7:12:24 PM
 Subject: Synchronization and merging indexes
 
 
 Hi. I'm currently using Lucene.Net as the backing store for a client Windows
 app and it's working great, however I'm now looking at making this an
 occasionally-connected remote-synchronized store.
 
 In other words, I want to use one of the free online storage APIs out there
 that my users can subscribe to and provide login credentials, and use it to
 back up entire copies of the index (we're talking relatively small indexes
 here).
 
 The scenario should allow for multiple clients to be simultaneously
 modifying their local copies of the index, and therefore I will need to
 merge the indexes to allow for multiple sources of change.
 
 My question is first of all if anyone has any experience with this, just for
 some advice, but in particular I'm concerned with the merging process--does
 merging two indexes simply concatenate all documents in each, even if they
 are identical, or is there some kind of logic performed to union duplicates?
 If not, how should I go about doing that manually in an efficient way?
 
 I'm not terribly worried about conflicts or collisions--in the worst case I
 can simply duplicate the document, but I don't want duplicate copies of
 documents created when there is no conflict.
 
 Thanks for any advice.
 
 Logan
 -- 
 View this message in context: 
 http://www.nabble.com/Synchronization-and-merging-indexes-tp21110690p21110690.html
 Sent from the Lucene - General mailing list archive at Nabble.com.



Re: Local Lucene and Local Solr

2008-08-25 Thread Otis Gospodnetic
1. sounds like the right choice to me.  On the topic of committing early, would 
committing it and allowing people to svn up/co, build locally, and implement 
the missing pieces not get us faster to the point of being able to release it?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Grant Ingersoll [EMAIL PROTECTED]
 To: general@lucene.apache.org
 Sent: Monday, August 25, 2008 11:41:10 AM
 Subject: Local Lucene and Local Solr
 
 The creators of Local Lucene and Local Solr 
 (http://www.nsshutdown.com/projects/lucene/whitepaper/locallucene.htm 
 ) have generously agreed to donate the code to Lucene.
 
 The Lucene PMC is working through the details of the software grant.  
 The one remaining road block, potentially, is that there is still some  
 LGPL code involved that needs to be replaced.   We could commit this  
 before removing it, as long as we don't release it.  So, if there are  
 volunteers willing to do the work, I'd be more inclined to move  
 forward w/ finishing out the grant and committing it.
 
 In the meantime, I would like to open the discussion of where this  
 should live in Lucene.
 
 The options are:
 
 1. Split them up and make them each a part of Lucene and Solr and let  
 the committers of those projects decide where things go
 2. Create a separate Geo search subproject under Lucene TLP with it's  
 own set of committers, etc. just like any of the other sub projects  
 (Solr, Tika, Java, etc.)  This requires the PMC to vote to create a  
 new subproject.
 3. Other?
 
 So, what do people think?  Where would you like to see Local Search  
 live w.r.t. Lucene and Solr?
 
 
 -Grant



Re: Lucene is not able to index certain words of txt file converted form pdf

2008-06-18 Thread Otis Gospodnetic
Hi,

Use java-user list, there are more people on it.

You need to change the setting in IndexWriter that tells Lucene how many tokens 
froma a document to index.  By default it indexes only 10,000.  I can't 
remember the parameter name, but look at the IndexWriter javadocs, it's right 
there.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
 From: m657m [EMAIL PROTECTED]
 To: general@lucene.apache.org
 Sent: Wednesday, June 18, 2008 8:24:53 AM
 Subject: Lucene is not able to index certain words of txt file converted form 
 pdf
 
 
 Hi
 
 I am using Lucene for indexing and searching the documents.
 I have an PDF (Lucene_in_action.pdf) file which i converted to txt file
 using PDFBox.
 The same txt file i indexed but while searching its not able to saerch
 certain words. But Lucene has given me the results if i search for other
 words.
 I am not able to find any reason for that.
 If any of you intellectuals can help me out in finding the reason.
 
 Thanks in advance. 
 -- 
 View this message in context: 
 http://www.nabble.com/Lucene-is-not-able-to-index-certain-words-of-txt-file-converted-form-pdf-tp17981585p17981585.html
 Sent from the Lucene - General mailing list archive at Nabble.com.



Re: Wildcard Search over multiple fields

2008-05-07 Thread Otis Gospodnetic
Hello,

Wildard queries are inefficient in general.  But it sounds like you simply want 
to combine them into a BooleanQuery where each clause is a SHOULD clause.

A better place to ask is java-user list.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
 From: jm85 [EMAIL PROTECTED]
 To: general@lucene.apache.org
 Sent: Wednesday, May 7, 2008 7:14:03 AM
 Subject: Wildcard Search over multiple fields
 
 
 Hello,
 
 What is the best method of performing a leading and trailing wildcard search
 over multiple fields? Currently I performing a wildcard search on one field
 at the time, but this potentially inefficient:
 
 WildcardQuery wildCardQuery = new WildcardQuery(new Term(searchField, * +
 searchText + *));
 
 Thanks for your help,
 
 James Murphy
 -- 
 View this message in context: 
 http://www.nabble.com/Wildcard-Search-over-multiple-fields-tp17101839p17101839.html
 Sent from the Lucene - General mailing list archive at Nabble.com.



Re: Improving indexing and some questions

2008-03-25 Thread Otis Gospodnetic
Marko,

You are not getting any responses here because this general@ list is pretty 
empty.
Please email java-user list.  I mentioned this in my previous reply, but for 
some reason you didn't go for it.

Please see http://wiki.apache.org/lucene-java/HowToContribute

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Marko Novakovic [EMAIL PROTECTED]
To: general@lucene.apache.org
Sent: Monday, March 24, 2008 8:17:40 PM
Subject: Improving indexing and some questions

Dear,

I have ideas for improving indexing for web search. I
have written the tutorial for IPSI conference in
Opatija about ranking in search engines:The new
Avenues in Web Search. I will have been published
article in IPSI Magazine by October, 2008.
This tutorial and my ideas was inspired by articles
from IEEE, Computer Magazine, Issue August, 2007.
I wrote about individual, collaborative, sponsored and
mobile search and social aspects at the Web.
The main idea is to implement indexing based on
relational database. This database would involve
evidence about users, physical and logical
communities(like some enterprise, country, antonomous
sysem, provider, etc.), queries, and user's clicks.
The service which track and analyze user's behaviour
would be also involved.
Indexing will be dynamic propagated by user's recent
behavior(clicks for same or similar query). Ranking
would be implementad by support vector machine, which
would give relevance for each query for each user.
This algorithm is described in article:
T. Joachims, F. Radlinski: Search Engines that
Laerning from Implicit Feedback, IEEE Computer,
August 2007, pp 38
Community indexin would be implemented by making
relevant promotions which is described in article:
B.Smyth:A Community Based Approach to Personalizing
Web Search, IEEE Computer, August 2007, pp 45-46
I also deliberate some concepts, which could be
implemented for indexing in sponsored and mobile
search and social web.

I will be honoured giving feedback from Apache's
staff.

Best regards

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 





Re: how to control the disk size of the indices

2008-03-24 Thread Otis Gospodnetic
Hi Yannis,

I don't think there is anything of that sort in Lucene, but this shouldn't be 
hard to do with a process outside Lucene.  Of course. optimizing an index 
increases its size temporarily, so your external process would have to take 
that into account and play it safe.  You could also set mergeFactor to 1, which 
should keep your index in a fully optimized state if you don't do any deletions 
and near-optimized state if you do deletions.

You should discuss this on java-user list, though, so I'm CCing that list where 
you can continue the discussion.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Yannis Pavlidis [EMAIL PROTECTED]
To: general@lucene.apache.org
Sent: Monday, March 24, 2008 7:33:26 PM
Subject: how to control the disk size of the indices


Hi all,

I wanted to ask the list whether there is an easy and efficient way to manage 
the size (in bytes) of a lucene index stored on disk.

Basically I would like to limit lucene storing only 100 GB of information. When 
lucene reaches that limit then I would delete the documents (using an LRU 
algorithm based on timestaps) but in no case the disk space occupied by Lucene 
should exceed 100GB.

I experimented with lucene 2.3.1 and the only I could accomplish that was by 
calling the optimize method (after the index size exceeded the max size) on the 
IndexWriter. I was looking for a more performant way to perhaps control 
Lucene on when to merge the segments so as to not exceed the pre-set limit.

Any ideas or suggestions would be highly appreciated.

Thanks in advance,

Yannis.





Re: Google Summer of Code

2008-03-19 Thread Otis Gospodnetic
Bok Marko,

Very interested.  I suggest you continue the discussion on [EMAIL PROTECTED], 
though (CC-ing)

You should note that there are several efforts around distributed Lucene.  
There is SOLR-303 for distributed search, and there is some work in progress in 
Hadoop land around distributed indexing in a Hadoop cluster.

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Marko Novakovic [EMAIL PROTECTED]
To: general@lucene.apache.org
Sent: Wednesday, March 19, 2008 8:02:29 PM
Subject: Google Summer of Code

Dear,

I have idea to implement distributed version of Lucene
for Google Summer of Code. Distributed version would
improve speed of ranking. I have also idea to
implement ranking criteria based on users' behavior
and community based.
If you are interested in this I will describe all
details of my idea.

Greetings.
Marko Novakovic


  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ





Re: Lucene indexes in memory

2007-02-14 Thread Otis Gospodnetic
Deepa,

You probably want to ask on [EMAIL PROTECTED] list.
Lucene reads in the whole .tii index file (see the Lucene for explanations of 
various Lucene index files).
It doesn't read in *all* the index files, as those could be quite big.
You *can* read in your index in a RAMDirectory via FSDirectory, though, and it 
sounds like that is what you are after.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Deepa Paranjpe [EMAIL PROTECTED]
To: general@lucene.apache.org
Sent: Wednesday, February 14, 2007 2:32:53 PM
Subject: Lucene indexes in memory

Hi all,

I want to understand how lucene searches its index -- does it load the whole
index into memory at once? Is there any way to make sure that it does so.

I want to optimize maximally on the search time required by lucene on over
~7M short documents. The queries that I deal are 6 to 7 tokens on an
average.

Your help on this will be appreciated.

-Deepa







Re: [Fwd: [PROPOSAL] index server project]

2006-10-20 Thread Otis Gospodnetic
That's distributed indexed, built on top of Sun Grid.  The project won a $50K 
prize.


- Original Message 
From: Alexandru Popescu [EMAIL PROTECTED]
To: general@lucene.apache.org
Sent: Thursday, October 19, 2006 10:19:00 AM
Subject: Re: [Fwd: [PROPOSAL] index server project]

I am not sure this is (somehow) related, but I think I have noticed
some project on a Sun contest (it was the big prize winner). I cannot
retrieve it now, but hopefully somebody else will.

./alex
--
.w( the_mindstorm )p.


On 10/19/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
 Hi Doug,

 we discussed the need of such a tool several times internally and
 developed some workarounds for nutch, so I would be definitely
 interested to contribute to such a project.
 Having a separated project that depends on hadoop would be the best
 case for our usecases.

 Best,
 Stefan



 Am 18.10.2006 um 23:35 schrieb Doug Cutting:

  FYI, I just pitched a new project you might be interested in on
  [EMAIL PROTECTED]  Dunno if you subscribe to that list, so I'm
  spamming you.  If it sounds interesting, please reply there.  My
  management at Y! is interested in this, so I'm 'in'.
 
  Doug
 
   Original Message 
  Subject: [PROPOSAL] index server project
  Date: Wed, 18 Oct 2006 14:17:30 -0700
  From: Doug Cutting [EMAIL PROTECTED]
  Reply-To: general@lucene.apache.org
  To: general@lucene.apache.org
 
  It seems that Nutch and Solr would benefit from a shared index serving
  infrastructure.  Other Lucene-based projects might also benefit from
  this.  So perhaps we should start a new project to build such a thing.
  This could start either in java/contrib, or as a separate sub-project,
  depending on interest.
 
  Here are some quick ideas about how this might work.
 
  An RPC mechanism would be used to communicate between nodes (probably
  Hadoop's).  The system would be configured with a single master node
  that keeps track of where indexes are located, and a number of slave
  nodes that would maintain, search and replicate indexes.  Clients
  would
  talk to the master to find out which indexes to search or update, then
  they'll talk directly to slaves to perform searches and updates.
 
  Following is an outline of how this might look.
 
  We assume that, within an index, a file with a given name is written
  only once.  Index versions are sets of files, and a new version of an
  index is likely to share most files with the prior version.  Versions
  are numbered.  An index server should keep old versions of each index
  for a while, not immediately removing old files.
 
  public class IndexVersion {
String Id;   // unique name of the index
int version; // the version of the index
  }
 
  public class IndexLocation {
IndexVersion indexVersion;
InetSocketAddress location;
  }
 
  public interface ClientToMasterProtocol {
IndexLocation[] getSearchableIndexes();
IndexLocation getUpdateableIndex(String id);
  }
 
  public interface ClientToSlaveProtocol {
// normal update
void addDocument(String index, Document doc);
int[] removeDocuments(String index, Term term);
void commitVersion(String index);
 
// batch update
void addIndex(String index, IndexLocation indexToAdd);
 
// search
SearchResults search(IndexVersion i, Query query, Sort sort, int n);
  }
 
  public interface SlaveToMasterProtocol {
// sends currently searchable indexes
// recieves updated indexes that we should replicate/update
public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
  }
 
  public interface SlaveToSlaveProtocol {
String[] getFileSet(IndexVersion indexVersion);
byte[] getFileContent(IndexVersion indexVersion, String file);
// based on experience in Hadoop, we probably wouldn't really use
// RPC to send file content, but rather HTTP.
  }
 
  The master thus maintains the set of indexes that are available for
  search, keeps track of which slave should handle changes to an
  index and
  initiates index synchronization between slaves.  The master can be
  configured to replicate indexes a specified number of times.
 
  The client library can cache the current set of searchable indexes and
  periodically refresh it.  Searches are broadcast to one index with
  each
  id and return merged results.  The client will load-balance both
  searches and updates.
 
  Deletions could be broadcast to all slaves.  That would probably be
  fast
  enough.  Alternately, indexes could be partitioned by a hash of each
  document's unique id, permitting deletions to be routed to the
  appropriate slave.
 
  Does this make sense?  Does it sound like it would be useful to Solr?
  To Nutch?  To others?  Who would be interested and able to work on it?
 
  Doug
 

 ~~~
 101tec Inc.
 search tech for web 2.1
 Menlo Park, California
 http://www.101tec.com










Re: [Fwd: [PROPOSAL] index server project]

2006-10-20 Thread Otis Gospodnetic
Damn Y! mail shortcut.
The link to the project is in my Lucene group:  http://www.simpy.com/group/363

Otis

- Original Message 
From: Alexandru Popescu [EMAIL PROTECTED]
To: general@lucene.apache.org
Sent: Thursday, October 19, 2006 10:19:00 AM
Subject: Re: [Fwd: [PROPOSAL] index server project]

I am not sure this is (somehow) related, but I think I have noticed
some project on a Sun contest (it was the big prize winner). I cannot
retrieve it now, but hopefully somebody else will.

./alex
--
.w( the_mindstorm )p.


On 10/19/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
 Hi Doug,

 we discussed the need of such a tool several times internally and
 developed some workarounds for nutch, so I would be definitely
 interested to contribute to such a project.
 Having a separated project that depends on hadoop would be the best
 case for our usecases.

 Best,
 Stefan



 Am 18.10.2006 um 23:35 schrieb Doug Cutting:

  FYI, I just pitched a new project you might be interested in on
  [EMAIL PROTECTED]  Dunno if you subscribe to that list, so I'm
  spamming you.  If it sounds interesting, please reply there.  My
  management at Y! is interested in this, so I'm 'in'.
 
  Doug
 
   Original Message 
  Subject: [PROPOSAL] index server project
  Date: Wed, 18 Oct 2006 14:17:30 -0700
  From: Doug Cutting [EMAIL PROTECTED]
  Reply-To: general@lucene.apache.org
  To: general@lucene.apache.org
 
  It seems that Nutch and Solr would benefit from a shared index serving
  infrastructure.  Other Lucene-based projects might also benefit from
  this.  So perhaps we should start a new project to build such a thing.
  This could start either in java/contrib, or as a separate sub-project,
  depending on interest.
 
  Here are some quick ideas about how this might work.
 
  An RPC mechanism would be used to communicate between nodes (probably
  Hadoop's).  The system would be configured with a single master node
  that keeps track of where indexes are located, and a number of slave
  nodes that would maintain, search and replicate indexes.  Clients
  would
  talk to the master to find out which indexes to search or update, then
  they'll talk directly to slaves to perform searches and updates.
 
  Following is an outline of how this might look.
 
  We assume that, within an index, a file with a given name is written
  only once.  Index versions are sets of files, and a new version of an
  index is likely to share most files with the prior version.  Versions
  are numbered.  An index server should keep old versions of each index
  for a while, not immediately removing old files.
 
  public class IndexVersion {
String Id;   // unique name of the index
int version; // the version of the index
  }
 
  public class IndexLocation {
IndexVersion indexVersion;
InetSocketAddress location;
  }
 
  public interface ClientToMasterProtocol {
IndexLocation[] getSearchableIndexes();
IndexLocation getUpdateableIndex(String id);
  }
 
  public interface ClientToSlaveProtocol {
// normal update
void addDocument(String index, Document doc);
int[] removeDocuments(String index, Term term);
void commitVersion(String index);
 
// batch update
void addIndex(String index, IndexLocation indexToAdd);
 
// search
SearchResults search(IndexVersion i, Query query, Sort sort, int n);
  }
 
  public interface SlaveToMasterProtocol {
// sends currently searchable indexes
// recieves updated indexes that we should replicate/update
public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
  }
 
  public interface SlaveToSlaveProtocol {
String[] getFileSet(IndexVersion indexVersion);
byte[] getFileContent(IndexVersion indexVersion, String file);
// based on experience in Hadoop, we probably wouldn't really use
// RPC to send file content, but rather HTTP.
  }
 
  The master thus maintains the set of indexes that are available for
  search, keeps track of which slave should handle changes to an
  index and
  initiates index synchronization between slaves.  The master can be
  configured to replicate indexes a specified number of times.
 
  The client library can cache the current set of searchable indexes and
  periodically refresh it.  Searches are broadcast to one index with
  each
  id and return merged results.  The client will load-balance both
  searches and updates.
 
  Deletions could be broadcast to all slaves.  That would probably be
  fast
  enough.  Alternately, indexes could be partitioned by a hash of each
  document's unique id, permitting deletions to be routed to the
  appropriate slave.
 
  Does this make sense?  Does it sound like it would be useful to Solr?
  To Nutch?  To others?  Who would be interested and able to work on it?
 
  Doug
 

 ~~~
 101tec Inc.
 search tech for web 2.1
 Menlo Park, California
 http://www.101tec.com










Re: CLucene incubation - call for a mentor

2006-10-20 Thread Otis Gospodnetic
Hi Ben,

I can't volunteer, but you may want to check with Garrett Rooney.  He stopped 
work on lucene4c, so he may be interested in helping you with moving CLucene 
under Apache Lucene.

Otis

- Original Message 
From: Ben van Klinken [EMAIL PROTECTED]
To: general@lucene.apache.org
Sent: Saturday, October 14, 2006 3:20:10 AM
Subject: CLucene incubation - call for a mentor

Hi,

I am one of the developers of CLucene, a C++ port of Lucene.

A long while back, CLucene was invited to join the ASF incubation
program under Lucene. For various reasons this hasn't happend yet. But
CLucene has still been happily progressing and interest in the project
continues to increase - many open source projects (such as ht://dig
and strigi) as well as many companies use CLucene.

CLucene would of course do much better if we were part of the big
happy family of Lucene and its sub-projects. However, I believe our
main obstacle to this is the absence of an ASF mentor.

So basically I'm asking this: would Apache Lucene still like to have
us? If yes, would anyone be interested, or know of someone interested
in being our mentor?

Look forward to a response,

Ben





Re: What are the pros and cons of using the C# version of Lucene as compared to the Java version in a .NET environment?

2006-05-09 Thread Otis Gospodnetic
I have never used Lucene under Windows, but I do know that some quite high 
profile Internet companies have used Lucene.net port and are happy with it.  
See http://xanga.com

Otis

- Original Message 
From: George Carrette [EMAIL PROTECTED]
To: general@lucene.apache.org
Sent: Tuesday, May 9, 2006 1:00:08 PM
Subject: What are the pros and cons of using the C# version of Lucene as 
compared to the Java version in a .NET environment?

If you are developing mainly in the Microsoft.NET framework 2.0 then
it seems that you have 3 choices for running Lucene.

What are the pros and cons of each choice?

1. Use the C# code from the apache incubator project 
   Lucene.Net http://incubator.apache.org/projects/lucene.net.html 
2. Use the flagship project Lucene Java 
http://lucene.apache.org/java/docs/index.html
   With the Sun java runtime and define some web services in an application such
   As Apache Tomcat that you can call from your other .NET framework code.
3. Use the Lucene Java sources as above, but compile it using the J# compiler, 
such
   As illustrated here: http://alum.mit.edu/www/gjc/lucene-java-vjc.html

I am particularly interested in risks associated with choice #3. Is the 
Microsoft J# compiler to be trusted? Do the people using the gcj compiler have 
any experience to guide somebody considering the use of a Java compiler not 
provided by Sun?

This is for a mission-critical application at a high profile internet media 
company.








Re: What are the pros and cons of using the C# version of Lucene as compared to the Java version in a .NET environment?

2006-05-09 Thread Otis Gospodnetic
I think all of these questions are waiting a brave soul with an itch.  Got itch?

Otis


- Original Message 
From: Raghavendra Prabhu [EMAIL PROTECTED]
To: general@lucene.apache.org
Sent: Tuesday, May 9, 2006 4:21:54 PM
Subject: Re: What are the pros and cons of using the C# version of Lucene as 
compared to the Java version in a .NET environment?

So will the .NET port in functionality terms offer equivalent capacities as
that of Java.

Are there any crucial features which are missing and have not yet been
implemented in Lucene.net (considering the fact that lucene java is gonna
hit 2.0 soon and lucene.net is still in dev phase in 1.9)

Is there any advantage in term of speed ( when you look at the .NET port)

Are there any benchmark comparisons available

Rgds
Prabhu

On 5/10/06, Monsur Hossain [EMAIL PROTECTED] wrote:


 As Otis mentioned we are using Lucene.NET (the recent 1.9 build) and we
 are
 quite happy with it.  There were some memory leak bugs early on since .NET
 doesn't have an equivalent of Java's HashMap; but after those worked out,
 performance and scalibity has been great.  The issues we are focused on
 now
 are general Lucene issues (such as scaling a large index and reducing
 indexing times) rather than C# vs. Java issues.  I highly recommend it.

 Monsur
 Xanga.com



  -Original Message-
  From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, May 09, 2006 2:02 PM
  To: general@lucene.apache.org
  Subject: Re: What are the pros and cons of using the C#
  version of Lucene as compared to the Java version in a .NET
  environment?
 
  I have never used Lucene under Windows, but I do know that
  some quite high profile Internet companies have used
  Lucene.net port and are happy with it.  See http://xanga.com
 
  Otis
 
  - Original Message 
  From: George Carrette [EMAIL PROTECTED]
  To: general@lucene.apache.org
  Sent: Tuesday, May 9, 2006 1:00:08 PM
  Subject: What are the pros and cons of using the C# version
  of Lucene as compared to the Java version in a .NET environment?
 
  If you are developing mainly in the Microsoft.NET framework 2.0 then
  it seems that you have 3 choices for running Lucene.
 
  What are the pros and cons of each choice?
 
  1. Use the C# code from the apache incubator project
 Lucene.Net http://incubator.apache.org/projects/lucene.net.html
  2. Use the flagship project Lucene Java
  http://lucene.apache.org/java/docs/index.html
 With the Sun java runtime and define some web services in
  an application such
 As Apache Tomcat that you can call from your other .NET
  framework code.
  3. Use the Lucene Java sources as above, but compile it using
  the J# compiler, such
 As illustrated here:
  http://alum.mit.edu/www/gjc/lucene-java-vjc.html
 
  I am particularly interested in risks associated with choice
  #3. Is the Microsoft J# compiler to be trusted? Do the people
  using the gcj compiler have any experience to guide somebody
  considering the use of a Java compiler not provided by Sun?
 
  This is for a mission-critical application at a high profile
  internet media company.
 
 
 
 
 
 
 








RE: Binary fields in index

2005-09-26 Thread Otis Gospodnetic
One of the Jakarta Commons ones - jakarta.apache.org/commons/codec/ 

Otis

--- Tricia Williams [EMAIL PROTECTED] wrote:

 Which library can Base64 be found in?
 
 Thanks,
 Tricia
 
 On Mon, 26 Sep 2005, Koji Sekiguchi wrote:
 
  You can encode (e.g. base64) the binary data to get a String
  and store the String.
 
  Koji
 
   -Original Message-
   From: Fredrik Andersson [mailto:[EMAIL PROTECTED]
   Sent: Monday, September 26, 2005 6:31 PM
   To: general@lucene.apache.org
   Subject: Binary fields in index
  
  
   Hello Gang!
  
   Is there any trick, or undocumented way, to store binary
 (unindexed,
   untokenized) data in a Lucene Field? All the Field
   constructors just deal
   with Strings. I'm currently using another database to store
   binary data, but
   it would be very neat, and more efficient, to store it
   directly in Lucene.
  
   Thanks in advance,
   Fredrik
  
 
 
 



Re: How to install lucene on windows ?

2005-09-11 Thread Otis Gospodnetic
All you really need is:
http://apache.oc1.mirrors.redwire.net/jakarta/lucene/binaries/lucene-1.4.3.jar

Otis

--- Arpit Sharma [EMAIL PROTECTED] wrote:

 I have installed tomcat on XP but when I go to page
 http://apache.oc1.mirrors.redwire.net/jakarta/lucene/binaries/
 
 
 it shows lot's of files. Do I need to download all of
 them ? after downloading them what should I do ?
 
 Thanks
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around 
 http://mail.yahoo.com 
 



Re: Problem of indexing pdf files

2005-09-11 Thread Otis Gospodnetic
That's a log4j warning message, because one of the PDFBox classes is
trying to log something, and you don't have log4j configured
appropriately.  This is not a Lucene issue, and it's a warning, so you
can ignore it if you want.

Otis


--- tirupathi reddy [EMAIL PROTECTED] wrote:

 Hello,
  
 I am getting the following warning message when I am indexing the
 pdf files using Lucene Indexing.
  
  log4j:WARN No appenders could be found for logger
 (org.pdfbox.pdfparser.PDFParser).
  log4j:WARN Please initialize the log4j system properly.
  
 This is the code I am using:
  
  if(pdf.exists())
  {
   String text = ;
   try{ 
   PDDocument document = PDDocument.load(pdf); // laden des Files  
  
   PDFTextStripper pts = new PDFTextStripper(); //Extrahieren des
 Textes 
   text = pts.getText(document);  
   document.close();
   } 
  catch(IOException e){ 
  System.out.println(File not found); 
  }
 mDocument.add(Field.Text(fulltext, text));
  
  
 thanx,
  MTREDDY
  
  
 
 
 Tirupati Reddy Manyam 
 24-06-08, 
 Sundugaullee-24, 
 79110 Freiburg 
 GERMANY. 
 
 Phone: 00497618811257 
 cell : 004917624649007
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around 
 http://mail.yahoo.com 



Re: IndexWriter and IndexReader open at the same time

2005-08-08 Thread Otis Gospodnetic
If you have the Lucene book, look at Chapter 2 (page 59 under section
2.9 (Concurrency, thread-safety, and locking issues) in chapter 2
(Indexing)):

  http://www.lucenebook.com/search?query=concurrency+rules

Also, look at Lucene's Bugzilla, where you'll find a contribution that
helps with concurrent IndexReader/IndexWriter usage.

Otis


--- Greg Love [EMAIL PROTECTED] wrote:

 Hello,
 
 I have an application that gets many delete and write resquests at
 the
 same time.  to avoid opening and closing the IndexWriter and
 IndexReader everytime one of them need to do a write operation, i
 keep
 them both open and have a shared lock around them whenever i need to
 use them for writing.  everything seems to be working in order, but
 i'm not sure if this is a safe thing to do.  please let me know.
 
 thank you, 
 lavafish



Re: indexing FTP or HTTP or Database

2005-07-20 Thread Otis Gospodnetic
For indexing FTP and HTTP servers, see Nutch (sub-project of Lucene).

For indexing a DB you can write some custom JDBC to pull your data from
DB and index it with Lucene.  I imagine a few other people will email
suggestions ;)

Otis


--- Bassem Elsayed [EMAIL PROTECTED] wrote:

 How can I use lucene to index and search for FTP OR HTTP or Database?
 
 Thank you.
 
  
 
 Thanks and Best Regards,
 
 
 
 Bassem Elsayed Saad
 
 Software Engineer
 
 ICT Department
 
 Bibliotheca Alexandrina
 
 P.O. Box 138, Chatby
 
 Alexandria 21526, Egypt
 
 Tel: +(203) 483 , Ext: 1496
 
 Mob: +(2010) 627 2875
 
 Fax: +(203) 482 0405
 
 Email: [EMAIL PROTECTED]
 
 Web Site: www.bibalex.org http://www.bibalex.org/