Re: Crawler

2009-01-30 Thread Michael Wechner
Jay Malaluan schrieb: Hi, You can check out Nutch at http://lucene.apache.org/nutch/. also see http://incubator.apache.org/projects/droids.html Cheers Michael Regards, Jay Joel Malaluan Haroldo Nascimento-2 wrote: Hi, There is any crawler that integrate with index lucene ?

Re: Lucene OpenCms search - Xpath notation?

2009-01-30 Thread Michael Wechner
Kesarkar, Dipak schrieb: Hi, I am using OpenCms 7.0.5 with Lucene search engine. I need to index XML content for which I have a following field configuration in the opencms-search.xml unfortunately I don't have any knowledge re OpenCMS, but I think you rather want to ask there (or hav

Re: indexing binary files?

2009-01-30 Thread Michael McCandless
You can also create a Lucene field using a Reader, if the String is really too large to materialize at once. Such fields cannot be stored though. But, if the String really is so large, I would worry about the end user's experience (normally you want a Document to be a rather bite- sized

Re: indexing binary files?

2009-01-30 Thread Paul Feuer
The ~25 GB represents about 100 million events an avg of about 250 bytes each. the indexed and searchable values are normal things: small bits of text (8-10 bytes usually); longs; ints; etc... Also this 25GB is a per-day size, which is why expanding the values in it to ascii is problematic fro

AUTO: Zhou Lin Dai is out of the office. (returning 2009-02-02)

2009-01-30 Thread Zhou Lin Dai
I am out of the office until 2009-02-02.. I will check emails at night. For anything emergent, you can call my cell phone (86) 131 6290 0375. Note: This is an automated response to your message Re: Where to download package org.apache.lucene.search.trie sent on 24/1/09 19:19:09. This is the on

Re: indexing binary files?

2009-01-30 Thread Yonik Seeley
Yes, that should work. Stream the file, converting each record to a Lucene Document. All of the fields should probably be indexed only (not stored) for size reasons, and then you could have a single stored but not indexed field that would be the offset into your binary file. -Yonik On Fri, Jan

Re: indexing binary files?

2009-01-30 Thread Shashi Kant
Hi Paul, have you tried persisting the binaries in Base64 format and then indexing them? As you are aware, Base64 is a robust representation used in email attachments for example. Thanks Shashi - Original Message From: Paul Feuer To: java-user@lucene.apache.org Sent: Thursday, Janu

Best Practice for Lucene Search

2009-01-30 Thread ilwes
Hello, I googled, searched this Forum and read the manual, but I'm not sure what would be the best practice for Lucene search. I have an e-Commerce application with about 10 mySQL tables for my products. And I have an Index (which is working fine), with about 10 fields for every product. Is it a

Re: indexing binary files?

2009-01-30 Thread Paul Feuer
The binary events in the file are parsable by both our java server-side processes and the clients of these processes, so we need to keep the data in the binary format. ./paul Sent from my Verizon Wireless BlackBerry -Original Message- From: Shashi Kant Date: Fri, 30 Jan 2009 06:3

Re: Best Practice for Lucene Search

2009-01-30 Thread Nilesh Thatte
Hello I would store normalised data in MySQL and index only searchable content in Lucene. Regards Nilesh   From: ilwes To: java-user@lucene.apache.org Sent: Friday, 30 January, 2009 15:08:10 Subject: Best Practice for Lucene Search Hello, I googled, sea

Re: Best Practice for Lucene Search

2009-01-30 Thread Ian Lea
That answer is fine, but there are others. We store denormalized data in lucene, as you are doing, for display on web pages because we can get it out of lucene much faster then we can get it out of the various tables in the database. The database is not as fast as it might be, quite possibly slow

RE: Best Practice for Lucene Search

2009-01-30 Thread Uwe Schindler
We do it in the same way. We have our RDBMS for administer our metadata/data. The search frontend for end users works completely with Lucene/panFMP (www.pangaea.de). We marshal all our relational data to XML files and index their contents using lucene. But the XML file is also stored in lucene as s

Re: Best Practice for Lucene Search

2009-01-30 Thread Erick Erickson
Do you have a reasonable expectation that performance is going to be a problem? The reason I ask is that I'm always suspicious of efficiency arguments when "things are working fine". Unless and until you can confidently predict that you're going to hit a performance issue, do it the easiest way pos

Re: indexing binary files?

2009-01-30 Thread Shashi Kant
Unless I am missing something, not sure I see the issue here. You can convert to Base64 purely for indexing purposes and leave the original binary as-is. - Original Message From: Paul Feuer To: Lucene User List ; Shashi Kant Sent: Friday, January 30, 2009 10:12:33 AM Subject: Re: in

Re: indexing binary files?

2009-01-30 Thread Paul Feuer
Expanding 25+ GB per day is not ideal. If its possible to index the binary directly, as it sounds like it might, we'll just do that. I think what I was missing was - I didn't see AbstractField which seems like it has the stuff I need (if indeed Field is used as I assume it is) ./paul Sent

RE: indexing binary files?

2009-01-30 Thread Uwe Schindler
Hi Shashi, What is the sense of this? The base64 encoded documents cannot be tokenized and searched. To do this, they must be indexed as plain text. If you want to store the original binary values as document data in the index, you could also store them additionally as byte[] in the raw biary form

Re: indexing binary files?

2009-01-30 Thread Shashi Kant
Hi Uwe, I was suggesting writing a custom tokenizer. In the worst case it would be a character per token, might not be a very pretty solution, but should do the job. What do you think? Thanks Shashi On Fri, Jan 30, 2009 at 12:57 PM, Uwe Schindler wrote: > Hi Shashi, > > What is the sense of th

Concurrent IndexReader and IndexSearcher behavior

2009-01-30 Thread Kay Kay
Assume I have an index of size 20G and a main memory of 1G. I do the following steps in order. * Open an IndexSearcher on the directory. * Serve Searches from that directory Meanwhile (when the IndexSearcher isstill open on the directory) - the following operations are performed concurrently.

RE: Concurrent IndexReader and IndexSearcher behavior

2009-01-30 Thread Uwe Schindler
As long as the indexreader of the seracher is not reopened, the searcher will not see any changes and will so not crash. This works, because all changes are written in an extra file (.del for deleted docs). The concurrent reader will not see those changes. If the index is then optimized in paralle

Re: querying English conjugation of verbs and comparative and superlative of adjectives

2009-01-30 Thread Koji Sekiguchi
I've tried o.a.s.a.EnglishPorterFilterFactory, which creates org.tartarus.snowball.ext.EnglishStemmer, but didn't get any success... I'd like to search "went" and "gone" when I query "go". Thank you, Koji Erick Erickson wrote: If thou wast to investigate the stemmers would that work? I conf

Re: querying English conjugation of verbs and comparative and superlative of adjectives

2009-01-30 Thread Erick Erickson
I don't expect this to work at all. Stemmers apply heuristics to try to fold words into their stem. They are notoriously incapable of handling irregular forms of a word. You'd need to look more at a synonym list for words like your example. Best Erick On Fri, Jan 30, 2009 at 7:25 PM, Koji Sekiguc