RE: Italian web sites
sniff the IP and then using the database at the internet topology website http://netgeo.caida.org/perl/netgeo.cgi you can find the country of origin, (use that to populate your own DB) so retrieval decreases as you accumulate IPs), but that will give you the website in Italy (not Italian websites). Unfortunately unless Italian uses a different encoding for the page, picking it up from the page (JavaScript) won't help much. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Wednesday, April 24, 2002 1:03 PM To: [EMAIL PROTECTED] Subject: Italian web sites Hi all, I'm using Jobo for spidering web sites and lucene for indexing. The problem is that I'd like spidering only Italian web sites. How can I see discover the country of a web site? Dou you know some method that tou can suggest me? Thanks Laura -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Re:_HTML_parser
Otis, what's the final conclusion you've arrived at regarding the HTML filter/parsing? I have pretty much the same requirements as you do right now (extract text, and obtain the title). Kelvin - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, April 22, 2002 12:27 AM Subject: Re:_HTML_parser Laura, http://marc.theaimsgroup.com/?l=lucene-userw=2r=1s=Spindleq=b Oops, it's JoBo, not MoJo :) http://www.matuschek.net/software/jobo/ Otis --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi Otis, thanks for your reply. I have been looking for Spindle and Mojo for 2 hours but I don't found anything. Can you help me? Wher can I find something? Thanks for your help and time Laura Laura, Search the lucene-user and lucene-dev archives for things like: crawler spider spindle lucene sandbox Spindle is something you may want to look at, as is MoJo (not mentione d on lucene lists, use Google). Otis Did someone solve the problem to spider recursively a web pages? While trying to research the same thing, I found the following...here 's a good example of link extraction. Try http://www.quiotix.com/opensource/html-parser Its easy to write a Visitor which extracts the links; should take abou t ten lines of code. __ Do You Yahoo!? Yahoo! Games - play chess, backgammon, pool and more http://games.yahoo.com/ -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Games - play chess, backgammon, pool and more http://games.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Italian web sites
Hi all, I have found a very interesting library which is written in perl. The problem is now how I can use this library. Anyway the library is Textcat an you can find it: http://odur.let.rug.nl/~vannoord/TextCat/ Bye Laura combined with that you could use an italian stop- word list to run statistics on a page :-) ?!? On Wednesday 24 April 2002 11:02, [EMAIL PROTECTED] wrote: Hi all, I'm using Jobo for spidering web sites and lucene for indexing. The problem is that I'd like spidering only Italian web sites. How can I see discover the country of a web site? Dou you know some method that tou can suggest me? Thanks Laura -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED]
Re: Italian web sites
hm... this looks very interesting! if it is a perl exe you can just copy the text into a temp file and run the per exe on that file and redirect the output to another tmp file. then read the file and use the result in a lucene keyword. mvh karl øie On Wednesday 24 April 2002 13:46, [EMAIL PROTECTED] wrote: Hi all, I have found a very interesting library which is written in perl. The problem is now how I can use this library. Anyway the library is Textcat an you can find it: http://odur.let.rug.nl/~vannoord/TextCat/ Bye Laura combined with that you could use an italian stop- word list to run statistics on a page :-) ?!? On Wednesday 24 April 2002 11:02, [EMAIL PROTECTED] wrote: Hi all, I'm using Jobo for spidering web sites and lucene for indexing. The problem is that I'd like spidering only Italian web sites. How can I see discover the country of a web site? Dou you know some method that tou can suggest me? Thanks Laura -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
delete document
How do you delete a document from the index? I see in the FAQ to user IndexWriter.delete(Term), however I don't see this in the current API JavaDocs, and don't have this method present in the lucene-1.2-rc4.jar that I downloaded from this site. Tim Tschampel -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: delete document
it's actually the IndexReader, not the IndexWriter... happy hacking! On Wednesday 24 April 2002 15:27, Tim Tschampel wrote: How do you delete a document from the index? I see in the FAQ to user IndexWriter.delete(Term), however I don't see this in the current API JavaDocs, and don't have this method present in the lucene-1.2-rc4.jar that I downloaded from this site. Tim Tschampel -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Lucene in action at www.mil.fi
The index is built on the local filesystem once every day. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: 23. huhtikuuta 2002 10:06 To: [EMAIL PROTECTED] Subject: Re: Lucene in action at www.mil.fi Hi Jari whre do you build your index? On filesystem? Do you use database? Laura Hello, I'm glad to inform you that I've built a complete Lucene-based web search solution for the Finnish Defence Forces web site and that it's online as of this moment. You can see it in action at: http://www2.mil.fi:8080/haku/haku?q=hornet -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Cannot compile Lucene
I'm using Lucene rc4 and JavaCC 2.1. I'm trying to compile Lucene without Ant, by tossing the files into Project Builder (Mac OS X). I ran JavaCC on StandardTokenizer.jj with the standard options, tossed the resulting files into the project, and now I'm running into a few errors: 1. StandardTokenizer.jj:173 is org.apache.lucene.analysis.Token next() throws IOException which is JavaCC'd into StandardTokenizer.java:26 as final public org.apache.lucene.analysis.Token next() throws ParseException, IOException which isn't a valid override. javac says next() in org.apache.lucene.analysis.standard.StandardTokenizer cannot override next() in org.apache.lucene.analysis.TokenStream; overridden method does not throw org.apache.lucene.analysis.standard.ParseException 2. StandardTokenizer.java:26 says token.beginColumn,token.endColumn and there are no such member variables. Am I totally missing something here? Avi -- Avi Drissman [EMAIL PROTECTED] Argh! This darn mailserver is trunca -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Cannot compile Lucene
I've never used project builder (netbeans on OSX), but you may want to try not including the .jj files. --Peter On 4/24/02 8:02 AM, Avi Drissman [EMAIL PROTECTED] wrote: I'm using Lucene rc4 and JavaCC 2.1. I'm trying to compile Lucene without Ant, by tossing the files into Project Builder (Mac OS X). I ran JavaCC on StandardTokenizer.jj with the standard options, tossed the resulting files into the project, and now I'm running into a few errors: 1. StandardTokenizer.jj:173 is org.apache.lucene.analysis.Token next() throws IOException which is JavaCC'd into StandardTokenizer.java:26 as final public org.apache.lucene.analysis.Token next() throws ParseException, IOException which isn't a valid override. javac says next() in org.apache.lucene.analysis.standard.StandardTokenizer cannot override next() in org.apache.lucene.analysis.TokenStream; overridden method does not throw org.apache.lucene.analysis.standard.ParseException 2. StandardTokenizer.java:26 says token.beginColumn,token.endColumn and there are no such member variables. Am I totally missing something here? Avi -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Cannot compile Lucene
At 8:40 AM -0700 4/24/02, Peter Carlson wrote: I've never used project builder (netbeans on OSX), but you may want to try not including the .jj files. I don't include the .jj files. I compiled them with JavaCC 2.1 and included the resulting .java files in Project Builder. I had to do something similar where I took the existing query parser .jj file, tweaked it, and JavaCC'd it. I had no problems compiling the resulting .java files there. Avi -- Avi Drissman [EMAIL PROTECTED] Argh! This darn mailserver is trunca -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Italian web sites
Laura Hi all, I'm using Jobo for spidering web sites and lucene for indexing. The problem is that I'd like spidering only Italian web sites. How can I see discover the country of a web site? Dou you know some method that tou can suggest me? The best method I know is using n-grams of characters and use the frequencies of the n-grams that occur most: http://citeseer.nj.nec.com/context/698873/68861 Regards, Ype -- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Cannot compile Lucene
Just curious, what exactly people need to do to 'fix up the exceptions'? Editing of which files to change what to what? I'd just like to document that somewhere, that's why I'm asking... Otis --- Robert A. Decker [EMAIL PROTECTED] wrote: I got it working under Project Builder. You just have to fix up the exceptions yourself. Also, you'll get some warnings (121 warnings to be exact) during the linking stage stating that an Integer Constant is too large - just ignore these - they're wrong. thanks, rob http://www.robdecker.com/ http://www.planetside.com/ On Wed, 24 Apr 2002, Avi Drissman wrote: I'm using Lucene rc4 and JavaCC 2.1. I'm trying to compile Lucene without Ant, by tossing the files into Project Builder (Mac OS X). I ran JavaCC on StandardTokenizer.jj with the standard options, tossed the resulting files into the project, and now I'm running into a few errors: 1. StandardTokenizer.jj:173 is org.apache.lucene.analysis.Token next() throws IOException which is JavaCC'd into StandardTokenizer.java:26 as final public org.apache.lucene.analysis.Token next() throws ParseException, IOException which isn't a valid override. javac says next() in org.apache.lucene.analysis.standard.StandardTokenizer cannot override next() in org.apache.lucene.analysis.TokenStream; overridden method does not throw org.apache.lucene.analysis.standard.ParseException 2. StandardTokenizer.java:26 says token.beginColumn,token.endColumn and there are no such member variables. Am I totally missing something here? Avi -- Avi Drissman [EMAIL PROTECTED] Argh! This darn mailserver is trunca -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Games - play chess, backgammon, pool and more http://games.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Cannot compile Lucene
Unfortunately I didn't add the lucene files to source code control until after I had gotten the project built and working, and therefore after my edits... Here's some of the changes I do remember though: Get the source to the java JDK 1.2 StringBuffer and add it as org.apache.lucene.StringBuffer. This is because I'm stuck using the 1.1.8 version of the JDK which is missing some StringBuffer methods used by lucene. Fix up some exceptions. For example, in org.apache.lucene.analysis.standard.StandardFilter, the next() method now throws java.io.IOException, org.apache.lucene.analysis.standard.ParseException I believe before it just throw IOException... I believe the problems mostly just came up in the javaCC generated files. I think this is another one. In org.apache.lucene.queryparser.QueryParser, the method final public Query Query(String field) now throws: org.apache.lucene.queryParser.ParseException, org.apache.lucene.analysis.standard.ParseException thanks, rob http://www.robdecker.com/ http://www.planetside.com/ On Wed, 24 Apr 2002, Otis Gospodnetic wrote: Just curious, what exactly people need to do to 'fix up the exceptions'? Editing of which files to change what to what? I'd just like to document that somewhere, that's why I'm asking... Otis --- Robert A. Decker [EMAIL PROTECTED] wrote: I got it working under Project Builder. You just have to fix up the exceptions yourself. Also, you'll get some warnings (121 warnings to be exact) during the linking stage stating that an Integer Constant is too large - just ignore these - they're wrong. thanks, rob http://www.robdecker.com/ http://www.planetside.com/ On Wed, 24 Apr 2002, Avi Drissman wrote: I'm using Lucene rc4 and JavaCC 2.1. I'm trying to compile Lucene without Ant, by tossing the files into Project Builder (Mac OS X). I ran JavaCC on StandardTokenizer.jj with the standard options, tossed the resulting files into the project, and now I'm running into a few errors: 1. StandardTokenizer.jj:173 is org.apache.lucene.analysis.Token next() throws IOException which is JavaCC'd into StandardTokenizer.java:26 as final public org.apache.lucene.analysis.Token next() throws ParseException, IOException which isn't a valid override. javac says next() in org.apache.lucene.analysis.standard.StandardTokenizer cannot override next() in org.apache.lucene.analysis.TokenStream; overridden method does not throw org.apache.lucene.analysis.standard.ParseException 2. StandardTokenizer.java:26 says token.beginColumn,token.endColumn and there are no such member variables. Am I totally missing something here? Avi -- Avi Drissman [EMAIL PROTECTED] Argh! This darn mailserver is trunca -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Games - play chess, backgammon, pool and more http://games.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]