RE: Italian web sites

2002-04-24 Thread Nader S. Henein

sniff the IP and then using the database at the
internet topology website http://netgeo.caida.org/perl/netgeo.cgi
you can find the country of origin, (use that to populate your
own DB) so retrieval decreases as you accumulate IPs), but that will
give you the website in Italy (not Italian websites). Unfortunately unless
Italian
uses a different encoding for the page, picking it up from the page
(JavaScript)
won't help much.




-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, April 24, 2002 1:03 PM
To: [EMAIL PROTECTED]
Subject: Italian web sites


Hi all,

I'm using Jobo for spidering web sites and lucene for indexing. The
problem is that I'd like spidering only Italian web sites.
How can I see discover the country of a web site?

Dou you know some method that tou can suggest me?

Thanks


Laura



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Re:_HTML_parser

2002-04-24 Thread Kelvin Tan

Otis, what's the final conclusion you've arrived at regarding the HTML
filter/parsing?

I have pretty much the same requirements as you do right now (extract text,
and obtain the title).

Kelvin

- Original Message -
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, April 22, 2002 12:27 AM
Subject: Re:_HTML_parser


 Laura,

 http://marc.theaimsgroup.com/?l=lucene-userw=2r=1s=Spindleq=b

 Oops, it's JoBo, not MoJo :)
 http://www.matuschek.net/software/jobo/

 Otis

 --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
  Hi Otis,
 
  thanks for your reply. I have been looking for Spindle and Mojo for 2
 
  hours but I don't found anything.
 
  Can you help me? Wher can I find something?
 
  Thanks for your help and time
 
 
  Laura
 
 
 
 
   Laura,
  
   Search the lucene-user and lucene-dev archives for things like:
   crawler
   spider
   spindle
   lucene sandbox
  
   Spindle is something you may want to look at, as is MoJo (not
  mentione
  d
   on lucene lists, use Google).
  
   Otis
  
Did someone solve the problem to spider recursively a web pages?
  
 While trying to research the same thing, I found the
following...here
's a
 good example of link extraction.

 Try http://www.quiotix.com/opensource/html-parser

 Its easy to write a Visitor which extracts the links; should
  take
abou
t ten
 lines of code.
  
  
   __
   Do You Yahoo!?
   Yahoo! Games - play chess, backgammon, pool and more
   http://games.yahoo.com/
  
   --
   To unsubscribe, e-mail:   mailto:lucene-user-
  [EMAIL PROTECTED]
   For additional commands, e-mail: mailto:lucene-user-
  [EMAIL PROTECTED]
  
  


 __
 Do You Yahoo!?
 Yahoo! Games - play chess, backgammon, pool and more
 http://games.yahoo.com/

 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Italian web sites

2002-04-24 Thread [EMAIL PROTECTED]

Hi all,

I have found a very interesting library which is written in perl.
The problem is now how I can use this library.

Anyway the library is Textcat an you can find it:

http://odur.let.rug.nl/~vannoord/TextCat/

Bye

Laura

 combined with that you could use an italian stop-
word list to run statistics 
 on a page :-) ?!?
 
 On Wednesday 24 April 2002 11:02, [EMAIL PROTECTED] wrote:
  Hi all,
  
  I'm using Jobo for spidering web sites and lucene for indexing. The 
  problem is that I'd like spidering only Italian web sites. 
  How can I see discover the country of a web site?
  
  Dou you know some method that tou can suggest me?
  
  Thanks
  
  
  Laura
  
 
 
 --
 To unsubscribe, e-mail:   mailto:lucene-user-
[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:lucene-user-
[EMAIL PROTECTED]
 
 


Re: Italian web sites

2002-04-24 Thread Karl Øie

hm... this looks very interesting! if it is a perl exe you can just copy the 
text into a temp file and run the per exe on that file and redirect the 
output to another tmp file. then read the file and use the result in a lucene 
keyword.

mvh karl øie

On Wednesday 24 April 2002 13:46, [EMAIL PROTECTED] wrote:
 Hi all,
 
 I have found a very interesting library which is written in perl.
 The problem is now how I can use this library.
 
 Anyway the library is Textcat an you can find it:
 
 http://odur.let.rug.nl/~vannoord/TextCat/
 
 Bye
 
 Laura
 

  combined with that you could use an italian stop-

 word list to run statistics 

  on a page :-) ?!?
  
  On Wednesday 24 April 2002 11:02, [EMAIL PROTECTED] wrote:
 
   Hi all,
   
   I'm using Jobo for spidering web sites and lucene for indexing. The 
   problem is that I'd like spidering only Italian web sites. 
   How can I see discover the country of a web site?
   
   Dou you know some method that tou can suggest me?
   
   Thanks
   
   
   Laura
   
 
  
  
  --
  To unsubscribe, e-mail:   mailto:lucene-user-

 [EMAIL PROTECTED]

  For additional commands, e-mail: mailto:lucene-user-

 [EMAIL PROTECTED]

  


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




delete document

2002-04-24 Thread Tim Tschampel

How do you delete a document from the index?
I see in the FAQ to user IndexWriter.delete(Term), however I don't see
this in the current API JavaDocs, and don't have this method present in
the lucene-1.2-rc4.jar that I downloaded from this site.


Tim Tschampel



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: delete document

2002-04-24 Thread Karl Øie

it's actually the IndexReader, not the IndexWriter...

happy hacking!




On Wednesday 24 April 2002 15:27, Tim Tschampel wrote:
 How do you delete a document from the index?
 I see in the FAQ to user IndexWriter.delete(Term), however I don't see
 this in the current API JavaDocs, and don't have this method present in
 the lucene-1.2-rc4.jar that I downloaded from this site.


 Tim Tschampel


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Lucene in action at www.mil.fi

2002-04-24 Thread Jari Aarniala

The index is built on the local filesystem once every day.

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
 Sent: 23. huhtikuuta 2002 10:06
 To: [EMAIL PROTECTED]
 Subject: Re: Lucene in action at www.mil.fi
 
 Hi Jari
 
 whre do you build your index? On filesystem? Do you use database?
 
 Laura
 
 
  Hello,
 
  I'm glad to inform you that I've built a complete Lucene-based web
  search solution for the Finnish Defence Forces web site and that
it's
  online as of this moment.
 
  You can see it in action at:
  http://www2.mil.fi:8080/haku/haku?q=hornet



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Cannot compile Lucene

2002-04-24 Thread Avi Drissman

I'm using Lucene rc4 and JavaCC 2.1. I'm trying to compile Lucene 
without Ant, by tossing the files into Project Builder (Mac OS X). I 
ran JavaCC on StandardTokenizer.jj with the standard options, tossed 
the resulting files into the project, and now I'm running into a few 
errors:

1. StandardTokenizer.jj:173 is

org.apache.lucene.analysis.Token next() throws IOException

which is JavaCC'd into StandardTokenizer.java:26 as

final public org.apache.lucene.analysis.Token next() throws 
ParseException, IOException

which isn't a valid override. javac says

next() in org.apache.lucene.analysis.standard.StandardTokenizer 
cannot override next() in org.apache.lucene.analysis.TokenStream; 
overridden method does not throw 
org.apache.lucene.analysis.standard.ParseException

2. StandardTokenizer.java:26 says

token.beginColumn,token.endColumn

and there are no such member variables.

Am I totally missing something here?

Avi

-- 
Avi Drissman
[EMAIL PROTECTED]
Argh! This darn mailserver is trunca

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Cannot compile Lucene

2002-04-24 Thread Peter Carlson

I've never used project builder (netbeans on OSX), but you may want to try
not including the .jj files.

--Peter

On 4/24/02 8:02 AM, Avi Drissman [EMAIL PROTECTED] wrote:

 I'm using Lucene rc4 and JavaCC 2.1. I'm trying to compile Lucene
 without Ant, by tossing the files into Project Builder (Mac OS X). I
 ran JavaCC on StandardTokenizer.jj with the standard options, tossed
 the resulting files into the project, and now I'm running into a few
 errors:
 
 1. StandardTokenizer.jj:173 is
 
 org.apache.lucene.analysis.Token next() throws IOException
 
 which is JavaCC'd into StandardTokenizer.java:26 as
 
 final public org.apache.lucene.analysis.Token next() throws
 ParseException, IOException
 
 which isn't a valid override. javac says
 
 next() in org.apache.lucene.analysis.standard.StandardTokenizer
 cannot override next() in org.apache.lucene.analysis.TokenStream;
 overridden method does not throw
 org.apache.lucene.analysis.standard.ParseException
 
 2. StandardTokenizer.java:26 says
 
 token.beginColumn,token.endColumn
 
 and there are no such member variables.
 
 Am I totally missing something here?
 
 Avi


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Cannot compile Lucene

2002-04-24 Thread Avi Drissman

At 8:40 AM -0700 4/24/02, Peter Carlson wrote:

I've never used project builder (netbeans on OSX), but you may want to try
not including the .jj files.

I don't include the .jj files. I compiled them with JavaCC 2.1 and 
included the resulting .java files in Project Builder.

I had to do something similar where I took the existing query parser 
.jj file, tweaked it, and JavaCC'd it. I had no problems compiling 
the resulting .java files there.

Avi

-- 
Avi Drissman
[EMAIL PROTECTED]
Argh! This darn mailserver is trunca

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Italian web sites

2002-04-24 Thread Ype Kingma

Laura

Hi all,

I'm using Jobo for spidering web sites and lucene for indexing. The
problem is that I'd like spidering only Italian web sites.
How can I see discover the country of a web site?

Dou you know some method that tou can suggest me?

The best method I know is using n-grams of characters and
use the frequencies of the n-grams that occur most:
http://citeseer.nj.nec.com/context/698873/68861

Regards,
Ype

-- 

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Cannot compile Lucene

2002-04-24 Thread Otis Gospodnetic

Just curious, what exactly people need to do to 'fix up the
exceptions'?  Editing of which files to change what to what?

I'd just like to document that somewhere, that's why I'm asking...

Otis

--- Robert A. Decker [EMAIL PROTECTED] wrote:
 I got it working under Project Builder. You just have to fix up the
 exceptions yourself. Also, you'll get some warnings (121 warnings to
 be
 exact) during the linking stage stating that an Integer Constant is
 too
 large - just ignore these - they're wrong.
 
 thanks,
 rob
 
 http://www.robdecker.com/
 http://www.planetside.com/
 
 On Wed, 24 Apr 2002, Avi Drissman wrote:
 
  I'm using Lucene rc4 and JavaCC 2.1. I'm trying to compile Lucene 
  without Ant, by tossing the files into Project Builder (Mac OS X).
 I 
  ran JavaCC on StandardTokenizer.jj with the standard options,
 tossed 
  the resulting files into the project, and now I'm running into a
 few 
  errors:
  
  1. StandardTokenizer.jj:173 is
  
  org.apache.lucene.analysis.Token next() throws IOException
  
  which is JavaCC'd into StandardTokenizer.java:26 as
  
  final public org.apache.lucene.analysis.Token next() throws 
  ParseException, IOException
  
  which isn't a valid override. javac says
  
  next() in org.apache.lucene.analysis.standard.StandardTokenizer 
  cannot override next() in org.apache.lucene.analysis.TokenStream; 
  overridden method does not throw 
  org.apache.lucene.analysis.standard.ParseException
  
  2. StandardTokenizer.java:26 says
  
  token.beginColumn,token.endColumn
  
  and there are no such member variables.
  
  Am I totally missing something here?
  
  Avi
  
  -- 
  Avi Drissman
  [EMAIL PROTECTED]
  Argh! This darn mailserver is trunca
  
  --
  To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
  
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do You Yahoo!?
Yahoo! Games - play chess, backgammon, pool and more
http://games.yahoo.com/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Cannot compile Lucene

2002-04-24 Thread Robert A. Decker

Unfortunately I didn't add the lucene files to source code control until
after I had gotten the project built and working, and therefore after my
edits...

Here's some of the changes I do remember though:
Get the source to the java JDK 1.2 StringBuffer and add it as
org.apache.lucene.StringBuffer. This is because I'm stuck using the 1.1.8
version of the JDK which is missing some StringBuffer methods used by
lucene.

Fix up some exceptions. For example, in
org.apache.lucene.analysis.standard.StandardFilter, the next() method now
throws java.io.IOException, 
org.apache.lucene.analysis.standard.ParseException

I believe before it just throw IOException...

I believe the problems mostly just came up in the javaCC generated files. 

I think this is another one. In org.apache.lucene.queryparser.QueryParser,
the method final public Query Query(String field) now throws:
org.apache.lucene.queryParser.ParseException,
org.apache.lucene.analysis.standard.ParseException

thanks,
rob

http://www.robdecker.com/
http://www.planetside.com/

On Wed, 24 Apr 2002, Otis Gospodnetic wrote:

 Just curious, what exactly people need to do to 'fix up the
 exceptions'?  Editing of which files to change what to what?
 
 I'd just like to document that somewhere, that's why I'm asking...
 
 Otis
 
 --- Robert A. Decker [EMAIL PROTECTED] wrote:
  I got it working under Project Builder. You just have to fix up the
  exceptions yourself. Also, you'll get some warnings (121 warnings to
  be
  exact) during the linking stage stating that an Integer Constant is
  too
  large - just ignore these - they're wrong.
  
  thanks,
  rob
  
  http://www.robdecker.com/
  http://www.planetside.com/
  
  On Wed, 24 Apr 2002, Avi Drissman wrote:
  
   I'm using Lucene rc4 and JavaCC 2.1. I'm trying to compile Lucene 
   without Ant, by tossing the files into Project Builder (Mac OS X).
  I 
   ran JavaCC on StandardTokenizer.jj with the standard options,
  tossed 
   the resulting files into the project, and now I'm running into a
  few 
   errors:
   
   1. StandardTokenizer.jj:173 is
   
   org.apache.lucene.analysis.Token next() throws IOException
   
   which is JavaCC'd into StandardTokenizer.java:26 as
   
   final public org.apache.lucene.analysis.Token next() throws 
   ParseException, IOException
   
   which isn't a valid override. javac says
   
   next() in org.apache.lucene.analysis.standard.StandardTokenizer 
   cannot override next() in org.apache.lucene.analysis.TokenStream; 
   overridden method does not throw 
   org.apache.lucene.analysis.standard.ParseException
   
   2. StandardTokenizer.java:26 says
   
   token.beginColumn,token.endColumn
   
   and there are no such member variables.
   
   Am I totally missing something here?
   
   Avi
   
   -- 
   Avi Drissman
   [EMAIL PROTECTED]
   Argh! This darn mailserver is trunca
   
   --
   To unsubscribe, e-mail:  
  mailto:[EMAIL PROTECTED]
   For additional commands, e-mail:
  mailto:[EMAIL PROTECTED]
   
  
  
  --
  To unsubscribe, e-mail:  
  mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
  mailto:[EMAIL PROTECTED]
  
 
 
 __
 Do You Yahoo!?
 Yahoo! Games - play chess, backgammon, pool and more
 http://games.yahoo.com/
 


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]