Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-07 Thread Glen Newton
Thank-you. Glen On Sat, 6 Aug 2022 at 23:46, Tomoko Uchida wrote: > Hi Glen, > I verified your Jira/GitHub usernames and added a mapping. > > https://github.com/apache/lucene-jira-archive/commit/ae78d583b40f5bafa1f8ee09854294732dbf530b > > Tomoko > > > 20

Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-06 Thread Glen Newton
jira: gnewton github: gnewton (github.com/gnewton) Thanks, Glen On Sat, 6 Aug 2022 at 14:11, Tomoko Uchida wrote: > Hi everyone. > > I wanted to let you know that we'll extend the deadline until the date the > migration is started (the date is not fixed yet). > Please let us know your Ji

Re: Lucene 6.3 faceting documentation

2016-11-10 Thread Glen Newton
-date though. > > Shai > > On Thu, Nov 10, 2016 at 4:40 PM Glen Newton wrote: > > > I am looking for documentation on Lucene faceting. The most recent > > documentation I can find is for 4.0.0 here: > > > > http://lucene.apache.org/core/4_0_0/facet/org

Lucene 6.3 faceting documentation

2016-11-10 Thread Glen Newton
I am looking for documentation on Lucene faceting. The most recent documentation I can find is for 4.0.0 here: http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-files/userguide.html Is there more recent documentation for 6.3.0? Or 6.x? Thanks, Glen

Re: docid is just a signed int32

2016-08-19 Thread Glen Newton
> load a single document (or a fixed number of them) for every step. In > the case you call loadAll() there is a problem with memory. > > > > > 2016-08-19 15:39 GMT+02:00, Glen Newton : > > Making docid an int64 is a non-trivial undertaking, and this work needs > to > &

Re: docid is just a signed int32

2016-08-19 Thread Glen Newton
Making docid an int64 is a non-trivial undertaking, and this work needs to be compared against the use cases and how compelling they are. That said, in the lifetime of most software projects a decision is made to break backward compatibility to move the project forward. When/if moving to int64 hap

Re: docid is just a signed int32

2016-08-18 Thread Glen Newton
Or maybe it is time Lucene re-examined this limit. There are use cases out there where >2^31 does make sense in a single index (huge number of tiny docs). Also, I think the underlying hardware and the JDK have advanced to make this more defendable. Constructively, Glen On Thu, Aug 18, 2016 at

Re: Question about JoinUtil

2014-12-17 Thread Glen Newton
Query would look like if it allowed a 'toQuery' > capability and returned data from both sides of the join. > > 3. If you can denormalize your data into hierarchies, then you could > use index-time joining (BlockJoin) for better performance and easier > collecting of your gro

Re: Question about JoinUtil

2014-12-16 Thread Glen Newton
Anyone? On Thu, Dec 11, 2014 at 2:53 PM, Glen Newton wrote: > Is there any reason JoinUtil (below) does not have a 'Query toQuery' > available? I was wanting to filter on the 'to' side as well. I feel I > am missing something here. > > To make sure this is not

Question about JoinUtil

2014-12-11 Thread Glen Newton
Is there any reason JoinUtil (below) does not have a 'Query toQuery' available? I was wanting to filter on the 'to' side as well. I feel I am missing something here. To make sure this is not an XY problem, here is my use case: I have a many-to-many relationship. The left, join, and right 'table'

Re: [ANN] word2vec for Lucene

2014-11-20 Thread Glen Newton
Hi Koji, Semantic vectors is here: http://code.google.com/p/semanticvectors/ It is a project that has been around for a number of years and used by many people (including me http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html ). If you could compare and contrast word2vec

Re: IndexWriter croaks on large file

2014-02-14 Thread Glen Newton
You should consider making each _line_ of the log file a (Lucene) document (assuming it is a log-per-line log file) -Glen On Fri, Feb 14, 2014 at 4:12 PM, John Cecere wrote: > I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At > any rate, I don't have control over the siz

Re: Index-time term expansion

2013-05-03 Thread Glen Newton
Thanks :-) On Fri, May 3, 2013 at 2:31 PM, Alan Woodward wrote: > Hi Glen, > > You want the SynonymFilter: > http://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymFilter.html > > Alan Woodward > www.flax.co.uk > > > On 3 M

Index-time term expansion

2013-05-03 Thread Glen Newton
Hello, I know I've seen it go by on this list and elsewhere, but cannot seem to find it: can someone point me to the best way to do term expansions at indexing time. That is, when the sentence is: "This foo is in my way" And I somewhere: foo=bar|yak Lucene indexes something like: "This foo|bar|

Re: Upgrade Lucene to latest version (4.0) from 2.4.0

2013-01-09 Thread Glen Newton
I am in the process of upgrading LuSql from 2.x to 4.x and I am first going to 3.6 as the jump to 4.x was too big. I would suggest this to you. I think it is less work. Of course I am also able to offer LuSql to 3.6 users, so this is slightly different from your case. -Glen On Wed, Jan 9, 2013 a

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
adding an annotation to text. > > > On 12/13/2012 01:54 PM, Glen Newton wrote: >> >> It is not clear this is exactly what is needed/being discussed. >> >> From the issue: >> "We are also planning a Tokenizer/TokenFilter that can put parts of >> speec

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
It is not clear this is exactly what is needed/being discussed. >From the issue: "We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position." This adds it to a token, not a span. 'same position' does no

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
>Unfortunately, Lucene doesn't properly index spans (it records the start position but not the end position), so that limits what kind of matching you can do at search time. If this could be fixed (i.e. indexing the _end_ of a span) I think all the things that I want to do, and the things that can

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-12 Thread Glen Newton
+10 These are the kind of things you can do in GATE[1] using annotations[2]. A VERY useful feature. -Glen [1]http://gate.ac.uk [2]http://gate.ac.uk/wiki/jape-repository/annotations.html On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D. wrote: >>> Is there any (preliminary) code checked in

Re: Lucene in Corpus Linguistics

2012-09-26 Thread Glen Newton
Yes, very interested. --> Quick scan: very cool work! +10 :-) Thanks, Glen Newton On Wed, Sep 26, 2012 at 9:59 AM, Carsten Schnober wrote: > Hi, > in case someone is interested in an application of the Lucene indexing > engine in the field of corpus linguistics rather than

Re: Performance of storing data in Lucene vs other (No)SQL Databases

2012-05-18 Thread Glen Newton
Storing content in large indexes can significantly add to index time. The model of indexing fields only in Lucene and storing just a key, and then storing the content in some other container (DBMS, NoSql, etc) with the key as lookup is almost a necessity for this use case unless you have a complet

Re: Customizing indexing of large files

2012-02-27 Thread Glen Newton
od is > incrementToken, I have no idea what to do in it. > > Regards, > > Prakash Bande > Director - Hyperworks Enterprise Software > Altair Eng. Inc. > Troy MI > Ph: 248-614-2400 ext 489 > Cell: 248-404-0292 > > -Original Message- > From: Glen Ne

Re: Customizing indexing of large files

2012-02-27 Thread Glen Newton
I'd suggest writing a perl script or insert-favourite-scripting-language-here script to pre-filter this content out of the files before it gets to Lucene/Solr Or you could just grep for "Data' and"Description" (or is 'Description' multi-line)? -Glen Newto

Re: Can I detect incorrect language selection after creating an index?

2012-02-27 Thread Glen Newton
Do the check _before_ indexing. Use https://code.google.com/p/language-detection/ to verify the language of the text document before you put it in the index. -Glen Newton http://zzzoot.blogspot.com/ On Mon, Feb 27, 2012 at 10:53 AM, Ilya Zavorin wrote: > Suppose I have a bunch of t

Re: Castle for Lucene/Solr?

2011-09-04 Thread Glen Newton
"Caste" --> Castle https://bitbucket.org/acunu http://support.acunu.com/entries/20216797-castle-build-instructions It looks very promising. It is a kernel module and I'm not sure it can run in user space, which I'd prefer. -Glen Newton On Sat, Sep 3, 2011 at 9:21 PM,

Re: What kind of System Resources are required to index 625 million row table...???

2011-08-15 Thread Glen Newton
43 AIX allows different malloc policies to be used in the underlying system calls. Consider using the WATSON (!) malloc policy. p.134,136 and http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/sys_mem_alloc.htm Finally (or before doing all of

Re: Index one huge text file

2011-07-22 Thread Glen Newton
So to use Lucene-speak, each sentence is a document. I don't know how you are indexing and what code you are using (and what hardware, etc.), but you if you are not already, should consider multi-threading the indexing which should give you a significant indexing performance boost. -Glen On Fri

Re: Index one huge text file

2011-07-22 Thread Glen Newton
Could you elaborate what you want to do with the index of large documents? Do you want to search at the document or sentence level? This can drive how to index this content. -Glen On Fri, Jul 22, 2011 at 10:52 AM, starz10de wrote: > Hi, > > I have one text file that contains 60 000 sentences. Is

Re: Lucene Architecture Site (Prototype)

2011-07-07 Thread Glen Newton
gmail interprets the closing asterisk as part of the URL, for all three URLs --> 404s You might want to add a space before the '*'... -glen On Thu, Jul 7, 2011 at 2:17 PM, Abhishek Rakshit wrote: > Hey folks, > > We received great feedback on the Lucene Architecture site that we have been > buil

Re: Lucene on Multi-Processor/Core machines

2011-01-25 Thread Glen Newton
-threaded-query-lucene.html http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html Glen Newton On Tue, Jan 25, 2011 at 11:31 AM, Siraj Haider wrote: > Hello there, > I was looking for best practices for indexing/searching on a > multi-processor/core machine but

Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Glen Newton
Where do you get your Lucene/Solr downloads from? [x] ASF Mirrors (linked in our release announcements or via the Lucene website) [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [] I/we build them from source via an SVN/Git checkout. -Glen Newton

Re: Dataimport performance

2010-12-16 Thread Glen Newton
he Lucene list. If you have any questions, please contact me. Thanks, Glen Newton http://zzzoot.blogspot.com --> Old LuSql benchmarks: http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html On Thu, Dec 16, 2010 at 12:04 PM, Dyer, James wrote: > We have ~50 lon

IndexTank technology...

2010-11-11 Thread Glen Newton
Does anyone know what technology they are using: http://www.indextank.com/ Is it Lucene under the hood? Thanks, and apologies for cross-posting. -Glen http://zzzoot.blogspot.com -- - - To unsubscribe, e-mail: java-user-unsubs

Re: lucene usage on TREC data

2010-08-14 Thread Glen Newton
the ClueWeb collection http://trec.nist.gov/pubs/trec18/papers/arsc.WEB.pdf Expanding Queries Using Multiple Resources http://staff.science.uva.nl/~mdr/Publications/Files/trec2006-proceedings-genomics.pdf -Glen Newton http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html http

Re: Using categories with Lucene

2010-08-09 Thread Glen Newton
Hi Luan, Could you tell us the name and/or URL of this plugin so that the list might know about it? Thanks, Glen On 10 August 2010 12:21, Luan Cestari wrote: > > We would like to say thanks for the replies. > > We found a plugin in Nutch (the Creative Commons plugin) that does like Otis > said.

Re: Databases

2010-07-22 Thread Glen Newton
, in a Solr context. http://wiki.apache.org/solr/DataImportHandler Thanks, -Glen Newton LuSql author http://zzzoot.blogspot.com/ On 23 July 2010 15:46, manjula wijewickrema wrote: > Hi, > > Normally, when I am building my index directory for indexed documents, I > used to keep my i

Re: Best practices for searcher memory usage?

2010-07-14 Thread Glen Newton
There are a number of strategies, on the Java or OS side of things: - Use huge pages[1]. Esp on 64 bit and lots of ram. For long running, large memory (and GC busy) applications, this has achieved significant improvements. Like 300% on EJBs. See [2],[3],[4]. For a great article introducing and benc

Re: If you could have one feature in Lucene...

2010-02-27 Thread Glen Newton
Hello Uwe. That will teach me for not keeping up with the versions! :-) So it is up to the application to keep track of what it used for compression. Understandable. Thanks! Glen On 27 February 2010 10:17, Uwe Schindler wrote: > Hi Glen, > > >> Pluggable compression allowing for alternatives to

Re: If you could have one feature in Lucene...

2010-02-27 Thread Glen Newton
Pluggable compression allowing for alternatives to gzip for text compression for storing. Specifically I am interested in bzip2[1] as implemented in Apache Commons Compress[2]. While bzip2 compression is considerable slower than gzip (although decompression is not too much slower than gzip) it comp

Re: Exception while adding document in 3.0

2010-02-02 Thread Glen Newton
Documents cannot be re-used in v3.0? http://wiki.apache.org/lucene-java/ImproveIndexingSpeed -glen http://zzzoot.blogspot.com/ On 2 February 2010 02:55, Simon Willnauer wrote: > Ganesh, > > do you reuse your Document instances in any way or do you create new > docs for each add? > > simon > > O

Re: Lucene Java 3.0.0 RC1 now available for testing

2009-11-18 Thread Glen Newton
en looking at their index with > Luke. :) >  Otis > -- > Sematext is hiring -- http://sematext.com/about/jobs.html?mls > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR > > > > ----- Original Message >> From: Glen Newton >> To: java-user@

Re: Lucene Java 3.0.0 RC1 now available for testing

2009-11-17 Thread Glen Newton
s.apache.org/jira/browse/LUCENE-652 > https://issues.apache.org/jira/browse/LUCENE-1960 > > Glen Newton wrote: >> Could someone send me where the rationale for the removal of >> COMPRESSED fields is? I've looked at >> http://people.apache.org/~uschindler/staging-area/luce

Re: Lucene Java 3.0.0 RC1 now available for testing

2009-11-17 Thread Glen Newton
Could someone send me where the rationale for the removal of COMPRESSED fields is? I've looked at http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-rc1/changes/Changes.html#3.0.0.changes_in_runtime_behavior but it is a little light on the 'why' of this change. My fault - of course - f

Re: Lucene index write performance optimization

2009-11-10 Thread Glen Newton
You might try re-implementing, using ThreadPoolExecutor http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ThreadPoolExecutor.html glen 2009/11/10 Jamie Band : > Hi There > > Our app spends alot of time waiting for Lucene to finish writing to the > index. I'd like to minimize this. If y

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Glen Newton
This is basically what LuSql does. The time increases ("8h to 30 min") are similar. Usually on the order of an order of magnitude. Oh, the comments suggesting most of the interaction is with the database? The answer is: it depends. With large Lucene documents: Lucene is the limiting factor (worsen

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Glen Newton
LuSql Disclosure: I am the author of LuSql. -Glen Newton http://zzzoot.blogspot.com/ http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/Glen_Newton 2009/10/22 Paul Taylor : > I'm building a lucene index from a database, creating 1 about 1 million > documents, unsuprisingly

Re: Field with reader limitation arbitrary

2009-09-15 Thread Glen Newton
I appreciate your explanation, but I think that the use case I described merits a deeper exploration: Scenario 1: 16 threads indexing; queue size = 1000; present api; need to store In this scenario, there are always 1000 Strings with all the contents of their respective files. Averaging 50k per do

Re: Field with reader limitation arbitrary

2009-09-15 Thread Glen Newton
h and/or tests if > you have them. > > Cheers, > Anthony > > On Mon, Sep 14, 2009 at 1:03 PM, Glen Newton wrote: >> Hi, >> >> In 2.4.1, Field has 2 constructors that involve a Reader: >> public Field(String name, >>                  Reader

Field with reader limitation arbitrary

2009-09-14 Thread Glen Newton
tring name, Reader reader, Field.Store store, Field.Index index, Field.TermVector termVector) Constructively, Glen Newton http://zzzoot.blogspot.com/ http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index

Re: Indexing large files? - No answers yet...

2009-09-11 Thread Glen Newton
..@lucene.apache.org > [mailto:java-user-return-42272-paul_murdoch=emainc@lucene.apache.org] On > Behalf Of Glen Newton > Sent: Friday, September 11, 2009 9:53 AM > To: java-user@lucene.apache.org > Subject: Re: Indexing large files? - No answers yet... > > In this

Re: Indexing large files? - No answers yet...

2009-09-11 Thread Glen Newton
In this project: http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html I concatenate all the text of all of articles of a single journal into a single text file. This can create a text file that is 500MB in size. Lucene is OK in indexing files this size (in parallel even),

Re: [EASY]How to change the demo of lucene143 into a multithread one?

2009-08-13 Thread Glen Newton
You are optimizing before the threads are finished adding to the index. I think this should work: IndexWriter writer = new IndexWriter("D:\\index", new StandardAnalyzer(), true); File file=new File(args[0]); Thread t1=new Thread(new IndexFiles(writer,file)); Thread t2=new Thread(new IndexFiles(wri

Visualizing Semantic Journal Space (large scale) using full-text

2009-07-29 Thread Glen Newton
tion using only the full-text (no metadata). For more info & howto: http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html Glen Newton -- - - To unsubscribe, e-mail: java-user-unsubscr...@lucene.

Re: New tool: LSql

2009-04-14 Thread Glen Newton
e you include Lucene v2.3 in your > code...does it work correctly with indexes created on v2.4 as well? > - Greg > > > On Mon, Apr 13, 2009 at 6:49 PM, Glen Newton wrote: > >> As the creator of LuSql >> [http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

Re: New tool: LSql

2009-04-13 Thread Glen Newton
As the creator of LuSql [http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] I would have hoped for a more creative (and more different) name. :-) -glen 2009/4/13 jonathan esposito : > I created a command-line tool in Java that allows the user to execute > sql-like commands again

Re: Can I run Lucene in google app engine?

2009-04-13 Thread Glen Newton
Another solution is to have your application on the AppEngine, but the index is on another machine. Then the application 'proxies' the requests to the machine that has the index, which is using Solr [http://lucene.apache.org/solr/] or some other way to expose to the index to the web. Yes, this mea

Re: LuSQL download link error?

2009-04-02 Thread Glen Newton
Dear Shashi, It should work now. A temporary failure: our apologies. thanks, Glen 2009/4/2 Shashi Kant : > Hi all, I have been trying to get the latest version of LuSQL from the > NRC.ca website but get 404s on the download links. I have written to the > webmaster, but anyone have the jar handy

Re: "People you might know" ( a la Facebook) - *slightly offtopic*

2009-03-17 Thread Glen Newton
You might try looking in a list that talks about recommender systems. Google hits: - http://en.wikipedia.org/wiki/Recommendation_system - ACM Recommender Systems 2009 http://recsys.acm.org/ - A Guide to Recommender Systems http://www.readwriteweb.com/archives/recommender_systems.php 2009/3/17 Aaro

Re: public apology for company spam

2009-03-05 Thread Glen Newton
and your colleagues do not have infinite social capital, and hopefully you will have no reason to be forced to spend this capital in such an unfortunate manner in the future. :-) sincerely, Glen Newton 2009/3/5 Yonik Seeley : > This morning, an apparently over-zealous marketing firm, on behalf

Re: Merging database index with fulltext index

2009-03-01 Thread Glen Newton
I would suggest you try LuSql, which was designed specifically to index relational databases into Lucene. It has an extensive user manual/tutorial which has some complex examples involving multi-joins and sub-queries. I am the author of LuSql. LuSql home page: http://lab.cisti-icist.nrc-cnrc.gc.c

Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-19 Thread Glen Newton
onventional > processors execute an idle loop when there is no work to do, so > CPI may be artificially low, especially when the system is > somewhat idle. The UltraSPARC T1 and T2 "park" idle threads, > consuming no energy, when there is no work to do, so CPI may > be arti

Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-18 Thread Glen Newton
Could you give some configuration details: - Solaris version - Java VM version, heap size, and any other flags - disk setup You should also consider using huge pages (see http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html) I will also be posting performance gains using

Re: Visualization

2009-02-12 Thread Glen Newton
V1 of a project of mine, Ungava[1], which uses Lucene to index research articles and library catalog metadata, also uses Project Simile's Metaphor and Timeline. I have some simple examples using them: Here is the search for "cell" in articles: http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava/Search?

Re: [ANN] Lucid Imagination

2009-01-26 Thread Glen Newton
Congrats & good-luck on this new endeavour! -Glen :-) 2009/1/26 Grant Ingersoll : > Hi Lucene and Solr users, > > As some of you may know, Yonik, Erik, Sami, Mark and I teamed up with > Marc Krellenstein to create a company to provide commercial > support (with SLAs), training, value-add compone

Re: clustering with compass & terracotta

2009-01-15 Thread Glen Newton
There is a discussion here: http://www.terracotta.org/web/display/orgsite/Lucene+Integration Also of interest: "Katta - distribute lucene indexes in a grid" http://katta.wiki.sourceforge.net/ -glen http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html http://zzzoot.blo

Re: Help with installing Lucene

2009-01-07 Thread Glen Newton
> I'm not sure if it's a better idea to use something like Solr or start from > scratch and customize the application as I move forward. What do you think LuSql might be appropriate for your needs: "LuSql is a high-performance, simple tool for indexing data held in a DBMS into a Lucene index. It c

Re: FastSSFuzzy for faster fuzzy queries in Lucene

2009-01-06 Thread Glen Newton
- Fast Similarity Search in Large Dictionaries. http://fastss.csg.uzh.ch/ - Paper: Fast Similarity Search in Large Dictionaries. http://fastss.csg.uzh.ch/ifi-2007.02.pdf - FastSimilarSearch.java http://fastss.csg.uzh.ch/FastSimilarSearch.java - Paper: Fast Similarity Search in Peer-to-Peer Networks

Re: Taxonomy in Lucene

2008-12-10 Thread Glen Newton
Oops. Thanks! :-) 2008/12/10 Gary Moore <[EMAIL PROTECTED]>: > svn co https://bobo-browse.svn.sourceforge.net/svnroot/bobo-browse/trunk > bobo-browse > -Gary > Glen Newton wrote: >> >> I don't think this is an Open Source project: I couldn't find any >

Re: Taxonomy in Lucene

2008-12-10 Thread Glen Newton
I don't think this is an Open Source project: I couldn't find any source on the site and the only download is a jar with .class files... -glen 2008/12/10 John Wang <[EMAIL PROTECTED]>: > www.browseengine.com > -John > > On Wed, Dec 10, 2008 at 10:55 AM, Glen Newt

Re: Taxonomy in Lucene

2008-12-10 Thread Glen Newton
>From what I understand: faceted browse is a taxonomy of depth =1 A taxonomy in general has an arbitrary depth: Example: Biological taxonomy: Kingdom Animalia Phylum Acanthocephala Class Archiacanthocephala Phylum Annelida Kingdom Fungi Phylum Ascomycota Class Ascomycetes

Re: NIOFSDirectory

2008-12-05 Thread Glen Newton
oblems, generally you don't > want concurrent writes. > > -John > > On Thu, Dec 4, 2008 at 2:44 PM, Glen Newton <[EMAIL PROTECTED]> wrote: > >> Am I missing something here? >> >> Why not use: >> IndexWriter writer = new IndexWriter(NIOFSDi

Re: NIOFSDirectory

2008-12-04 Thread Glen Newton
s more on how to use NIOFSDirectory class. I am hoping for a simply >> answer, >> > is what I am doing (setting the class name statically on system property) >> > the right way? >> > >> > -John >> > >> > On Thu, Dec 4, 2008

Re: NIOFSDirectory

2008-12-04 Thread Glen Newton
Sorrywhat version are we talking about? :-) thanks, Glen 2008/12/4 Yonik Seeley <[EMAIL PROTECTED]>: > On Thu, Dec 4, 2008 at 4:11 PM, John Wang <[EMAIL PROTECTED]> wrote: >> Hi guys: >>We did some profiling and benchmarking: >> >>The thread contention on FSDIrectory is gone, and fo

Re: lucene nicking my memory ?

2008-12-03 Thread Glen Newton
Hi Magnus, Could you post the OS, version, RAM size, swapsize, Java VM version, hardware, #cores, VM command line parameters, etc? This can be very relevant. Have you tried other garbage collectors and/or tuning as described in http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html?

Merging indexes & multicore/multithreading

2008-12-02 Thread Glen Newton
Let's say I have 8 indexes on a 4 core system and I want to merge them (inside a single vm instance). Is it better to do a single merge of all 8, or to in parallel threads merge in pairs, until there is only a single index left? I guess the question involves how multi-threaded merging is and if it

Lucene 2.3.1 vs 2.4 benchmarks using LuSql

2008-11-24 Thread Glen Newton
I have some simple indexing benchmarks comparing Lucene 2.3.1 with 2.4: http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html In the next couple of days I will be running benchmarks comparing Solr's DataImportHandler/JdbcDataSource indexing performance with LuSql and wil

Software Announcement: LuSql: Database to Lucene indexing

2008-11-17 Thread Glen Newton
g an 86GB Lucene index in ~13 hours. http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql Glen Newton -- - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: "Global" Field question (thread-safe)?

2008-11-06 Thread Glen Newton
Thanks! :-) 2008/11/6 Michael McCandless <[EMAIL PROTECTED]>: > > The field never changes across all docs? If so, this will work fine. > > Mike > > Glen Newton wrote: > >> I have a use case where I want all of my documents to have - in >> addition to the

"Global" Field question (thread-safe)?

2008-11-06 Thread Glen Newton
I have a use case where I want all of my documents to have - in addition to their other fields - a single field=value. An example use is where I have multiple Lucene indexes that I search in parallel, but still need to distinguish them. Index 1: All documents have: source="a1" Index 2: All documen

Re: Document thread safe?

2008-10-31 Thread Glen Newton
Yes, the problem goes away when I do the following: synchronized(doc) { doc.add(field); } Thanks. [I'll use a Lock to do this properly] -glen 2008/10/31 Yonik Seeley <[EMAIL PROTECTED]>: > On Fri, Oct 31, 2008 at 11:53 AM, Glen Newton <[EMAIL PROTECTED]> wrote: >>

Document thread safe?

2008-10-31 Thread Glen Newton
Hello, I am using Lucene 2.3.1. I have concurrent threads adding Fields to the same Document, but getting some odd behaviour. Before going into too much depth, is Document thread-safe? thanks, Glen http://zzzoot.blogspot.com/ -- - ---

Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Glen Newton
2008/10/23 Michael McCandless <[EMAIL PROTECTED]>: > > Mark Miller wrote: > >> Glen Newton wrote: >>> >>> 2008/10/23 Mark Miller <[EMAIL PROTECTED]>: >>> >>>> It sounds like you might have some thread synchronization issues outside

Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Glen Newton
2008/10/23 Mark Miller <[EMAIL PROTECTED]>: > It sounds like you might have some thread synchronization issues outside of > Lucene. To simplify things a bit, you might try just using one IndexWriter. > If I remember right, the IndexWriter is now pretty efficient, and there > isn't much need to inde

Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Glen Newton
You might want to look at my indexing of 6.4 million PDF articles, full-text and metadata. It resulted in an 83GB index taking 20.5 hours to run. It uses multiple writers, is massively multithreaded. More info here: http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html Che

Re: Link map over results? or term freq

2008-10-16 Thread Glen Newton
See also: http://zzzoot.blogspot.com/2007/10/drill-clouds-for-search-refinement-id.html and http://zzzoot.blogspot.com/2007/10/tag-cloud-inspired-html-select-lists.html -glen 2008/10/16 Glen Newton <[EMAIL PROTECTED]>: > Yes, tag clouds. > > I've implemented them using

Re: Link map over results? or term freq

2008-10-16 Thread Glen Newton
lts they got back. Sort of > like latent relationships. > > Does that help? > > I thought this could be done using term frequency vectors in Lucene, but > I've never used TFV's before. And can then be limited to just a set of > results. > > HTH, > D

Re: Link map over results? or term freq

2008-10-16 Thread Glen Newton
Sorry, could you explain what you mean by a "link map over lucene results"? thanks, -glen 2008/10/16 Darren Govoni <[EMAIL PROTECTED]>: > Hi, > Has anyone created a link map over lucene results or know of a link > describing the process? If not, I would like to build one to contribute. > > Also,

Re: Indexing Scalability, Multiwriter?

2008-10-10 Thread Glen Newton
IndexWriter is thread-safe and has been for a while (http://www.mail-archive.com/[EMAIL PROTECTED]/msg00157.html) so you don't have to worry about that. As reported in my blog in April (http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html) but perhaps not explicitly enoug

Re: could I implement this scenario?

2008-09-19 Thread Glen Newton
> I think it is not good idea to use lucene as storage, it is just index. I strongly disagree with this position. To qualify my disagreement: yes, you should not use Lucene as your primary storage for your data in your organization. But, for a particular application, taking content from your pri

Re: Tree search

2008-08-07 Thread Glen Newton
There are a number of ways to do this. Here is one: Lose the parentid field (unless you have other reasons to keep it). Add a field fullName, and a field called depth : doc1 fullName: state depth: 0 doc2 fullName: state/department depth:1 doc3 fullName: state/department/Boston depth: 2 doc4 ful

Re: Scaling

2008-07-16 Thread Glen Newton
A subset of your questions are answered (or at least examined) in my postings on multi-thread queries on a multiple-core single system: http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html http://zzzoot.blogspot.com/2008/06/lucene-concurrent-search-performance.html -Glen 200

Re: How to make documents clustering and topic classification with lucene

2008-07-08 Thread Glen Newton
Use Carrot2: http://project.carrot2.org/ For Lucene + Carrot2: http://project.carrot2.org/faq.html#lucene-integration -glen 2008/7/7 Ariel <[EMAIL PROTECTED]>: > Hi everybody: > Do you have Idea how to make how to make documents clustering and topic > classification using lucene ??? Is there a

Re: Concurrent query benchmarks, with 1,2,4,8 readers

2008-06-13 Thread Glen Newton
Lutan, Yes, no problem. I am away at a conference next week but plan to release the code the following week. Is this OK for you? thanks, Glen 2008/6/13 lutan <[EMAIL PROTECTED]>: > > TO: Glen Newton Could I get your test code or code architecture for study. > I ha

Re: Concurrent query benchmarks, with 1,2,4,8 readers

2008-06-11 Thread Glen Newton
en the performance will slowly deteriorate with more > readers/searchers let's see it! I'm running it & will post when it is done. thanks, Glen :-) > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > - Original Messag

Concurrent query benchmarks, with 1,2,4,8 readers

2008-06-11 Thread Glen Newton
I have extended my evaluation (previous evaluation: http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html) to include as well as an increasing # of threads performing concurrent queries, 1,2,4 and 8 IndexReaders. The results can be found here: http://zzzoot.blogspot.com/2008/0

Re: Concurrent query benchmarks

2008-06-10 Thread Glen Newton
Lucene Database Search in 3 minutes: > http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes > DBSight customer, a shopping comparison site, (anonymous per request) got > 2.6 Million Euro funding! > > On Mon, Jun 9, 2008 at 3:51 PM, Glen Newton <[EMAIL PROT

Re: Concurrent query benchmarks

2008-06-10 Thread Glen Newton
I have, with the gnuplot scripts that I have. Let me finish off what I am doing for my work and I will clean things up a bit, write a little documentation. -Glen > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > - Original Message >&g

Concurrent query benchmarks

2008-06-09 Thread Glen Newton
A number of people have asked about query benchmarks. I have posted benchmarks for concurrent query requests for Lucene 2.3.1 on my blog, where I look at 1 - 4096 concurrent requests: http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html I hope you find this useful. thanks

Re: Multi-language support within a single index

2008-06-05 Thread Glen Newton
want. And it > works for both indexing and querying out-of-the-box. > > Best > Erick > > On Thu, Jun 5, 2008 at 12:14 PM, Glen Newton <[EMAIL PROTECTED]> wrote: > >> I would like to be able to get multi-language support within a single >> index. >>

Multi-language support within a single index

2008-06-05 Thread Glen Newton
d to make these sorts of manipulations to the nature of the segments files easier for mere mortal developers? :-) Is this something that is already being talked about/looked in to/being implemented? :-) thanks, Glen Newton http://zzzoot.blogspot.com/ --

  1   2   >