AW: DC metadata

2009-09-23 Thread Koch Martina
Hi, I don't know the howto you're referring to but I think it belongs to an older version of Nutch. Let me try to explain... doc.add(key,value) - adds a new field to the document doc with the name key and the value value. With that knowledge the indexer just knows there is another field to

AW: splitting an index (yes, again)

2009-09-23 Thread Koch Martina
Hi Jesse, I'm not sure what you're trying to achieve. Do you want to use the distributed search or do you want to split an existing index? None of these tasks is the prerequisite for the other. If you want to split an index, there are several ways to do this. Which way to choose depends on the

Re: Event search engine

2009-09-23 Thread Michael Wechner
Mitia Notaras schrieb: Hi there, The two event search engines I found are down : betherebesquare.com and BusyTonight.com I would like your advice : Is it difficult to build one? I guess it depends on the details of the requirements. Do you have a requirements sheet? I have knowledge of web

Re: splitting an index (yes, again)

2009-09-23 Thread Alexander Aristov
Ok, I will paraphrase the question. Consider I want to use distributed search using 3 servers: one primary and two secondary nodes. I create single BIG index using distributed crawler using other computers. Now I want to split this single BIG index on two parts to put on the search nodes. How

Re: HTML parsing and charset for Polish

2009-09-23 Thread Dawid Weiss
Polish Web sites use Cp1250 (windows-1250) or iso8859-2 (or UTF-8 of course). Check if diacritics like these: ęółąśćżń look all right in the above encodings and use appropriately. Dawid On Wed, Sep 16, 2009 at 4:47 PM, MilleBii mille...@gmail.com wrote: same thing when there is

Re: splitting an index (yes, again)

2009-09-23 Thread Jesse Hires
Exactly! sorry for being so confusing in my original question. Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Wed, Sep 23, 2009 at 4:45 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Ok, I

Re: HTML parsing and charset for Polish

2009-09-23 Thread MilleBii
At last someone answers. Correct CP1250. My pages look fine in the browsers of course, but it does not mean Nutch handles them properly. What I'm wondering is if the the nutch HTML parser reads them properly, because when I do a search on such characters it fails on pages iso8859-2 or cp1250, but

Specify at least one source--a file or resource collection error

2009-09-23 Thread Jaime Martín
Hi: I´m following the steps to run Nucth 1.0 release with Eclipse and Windows described in this link http://wiki.apache.org/nutch/RunNutchInEclipse1.0 I´m trying to build it, but when I launch the war target I have this error C:\ECLIPSE321\workspace\nutch-1.0\build.xml:62: Specify at least one

RE: AW: DC metadata

2009-09-23 Thread BELLINI ADAM
hi, thank you for your answer... i was talking about this howto : CreateNewFilter Howto add a category metadata to your index and be able to search for it. For this, you need to write an indexing filter and a query filter. Indexing your custom metadata For the indexing filter, copy the

AW: DC metadata

2009-09-23 Thread Koch Martina
Hi, the howtos you're referring to are for Nutch 0.9. In Nutch 1.0 the indexing system changed a little bit. If you look at the index-basic or index-more plugin you see that the doc.add method changed. It's no longer doc.add(new Field(category, puppies, false, true, false)) - here you create

RE: AW: DC metadata

2009-09-23 Thread BELLINI ADAM
yes i saw the differences and i wrote my index-cutom as the index-more plugin (nutch-1.0). but guess u right !! i didnt use the addFiledOptions method to add my custom fileds information ... so if i will add them in this method.. so for the parser i have to see first how is made the htmlparser

Re: AW: Null Indexing

2009-09-23 Thread Cisek
I had the same little big problem - everything seemed OK: - bin/nutch org.apache.nutch.searcher.NutchBean search query ... [in my case search query = apache] in cygwin returns 62 Total hits on cawled +^http://([a-z0-9]*\.)*apache.org/ - Nutch in Tomcat webapp after deploy seemed fine (no

Re: HTML parsing and charset for Polish

2009-09-23 Thread Dawid Weiss
Can you provide the HTTP headers and HEAD of the HTML of a Web page for which Nutch fails? Perhaps there is an inconsistency between HTTP and META headers or a mispelled codepage? Just a wild guess, but believe me -- Java does convert fine between Cp1250, Iso8859-2 and internal UTF-16 so there