Re: nutch 1.4, solr 3.4 configuration error

2013-06-07 Thread Tuğcem Oral
I had a similar error. I couldn't find any documentation which nutch and
solr versions are compatible. For instance, we' re using nutch 1.6 on
hadoop 1.0.4 with solrj 3.4.0 and index crawled segments to solr 4.2.0. But
I remember that I could find a compatible version of solrj for nutch 1.4
(because of using hadoop). You can upgrade your nutch from 1.4 to 1.6
easily. And also I suggest you to check for your solrindex-mapping.xml in
your /conf directory.

Best,

Tugcem.


On Fri, Jun 7, 2013 at 12:58 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:

 : ./nutch crawl urls -dir myCrawl2 -solr http://localhost:8080 -depth 2
 -topN
 ...
 : Caused by: org.apache.solr.common.SolrException: Not Found
 :
 : Not Found
 :
 : request: http://localhost:8080/select?q=id:[* TO
 : *]fl=idrows=1wt=javabinversion=2
 ...
 : Other possibly helpful information:
 : 1) The solr admin screen comes up fine in the browser.

 At which URL does the Solr admin screen come up fine in your browser?

 Best guess...

 1) you have solr installed such that it uses the webcontext /solr but
 you gave the wrong url to nutch (ie: try -solr
 http://localhost:8080/solr;)

 2) you are using multiple collections, and you may need to configure nutch
 to know about which collection you are using (ie: try -solr
 http://localhost:8080/solr/collection1;)

 ...if neither of those don't help, i would suggest you follow up with the
 nutch-user list, as the nutch community is probably in the best position
 to help you configure nutch to work with Solr and vice versa)


 -Hoss




-- 
TO


Re: nutch 1.4, solr 3.4 configuration error

2013-06-06 Thread bbarani
can you check if you have correct solrj client library version in both nutch
and Solr server.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/nutch-1-4-solr-3-4-configuration-error-tp4068724p4068733.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: nutch 1.4, solr 3.4 configuration error

2013-06-06 Thread Chris Hostetter
: ./nutch crawl urls -dir myCrawl2 -solr http://localhost:8080 -depth 2 -topN
...
: Caused by: org.apache.solr.common.SolrException: Not Found
: 
: Not Found
: 
: request: http://localhost:8080/select?q=id:[* TO
: *]fl=idrows=1wt=javabinversion=2
...
: Other possibly helpful information:
: 1) The solr admin screen comes up fine in the browser.

At which URL does the Solr admin screen come up fine in your browser?

Best guess...

1) you have solr installed such that it uses the webcontext /solr but 
you gave the wrong url to nutch (ie: try -solr 
http://localhost:8080/solr;)

2) you are using multiple collections, and you may need to configure nutch 
to know about which collection you are using (ie: try -solr 
http://localhost:8080/solr/collection1;)

...if neither of those don't help, i would suggest you follow up with the 
nutch-user list, as the nutch community is probably in the best position 
to help you configure nutch to work with Solr and vice versa)


-Hoss


Re: nutch and solr

2012-02-27 Thread alessio crisantemi
now, all works!

I have another problem If I use a conector with my solr-nutch.
this is the error:

Grave: java.lang.RuntimeException:
org.apache.lucene.index.CorruptIndexException: Unknown format version: -11
 at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068)
 at org.apache.solr.core.SolrCore.init(SolrCore.java:579)
 at org.apache.solr.core.CoreContainer.create(CoreContainer.java:428)
 at org.apache.solr.core.CoreContainer.load(CoreContainer.java:278)
 at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:117)
 at connector.SolrConnector.init(SolrConnector.java:33)
 at connector.SolrConnector.getInstance(SolrConnector.java:69)
 at connector.SolrConnector.getSolrServer(SolrConnector.java:77)
 at connector.QueryServlet.doGet(QueryServlet.java:117)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:722)
 at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305)
 at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
 at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
 at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
 at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
 at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
 at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
 at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
 at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
 at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
 at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
 at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
 at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:309)
 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.lucene.index.CorruptIndexException: Unknown format
version: -11
 at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:247)
 at
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:72)
 at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683)
 at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:403)
 at
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38)
 at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1057)
 ... 26 more

SUGGESTIONS?
thanks,
alessio

Il giorno 25 febbraio 2012 10:52, alessio crisantemi 
alessio.crisant...@gmail.com ha scritto:

 thi is the problem!
 Becaus in my root there is a url!

 I write you my step-by-step configuration of nutch:
 (I use cygwin because I work on windows)

 *1. Extract the Nutch package*

 *2. Configure Solr*
 (*Copy the provided Nutch schema from directory apache-nutch-1.0/conf to
 directory apache-solr-1.3.0/example/solr/conf (override the existing
 file) for *to allow Solr to create the snippets for search results so we
 need to store the content in addition to indexing it:

 *b. Change schema.xml so that the stored attribute of field “content” is
 true.*

 *field name=”content” type=”text” stored=”true” indexed=”true”/*

 We want to be able to tweak the relevancy of queries easily so we’ll
 create new dismax request handler configuration for our use case:

 *d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste
 following fragment to it*

 requestHandler name=/nutch class=solr.SearchHandler 

 lst name=defaults

 str name=defTypedismax/str

 str name=echoParamsexplicit/str

 float name=tie0.01/float

 str name=qf

 content^0.5 anchor^1.0 title^1.2

 /str

 str name=pf

 content^0.5 anchor^1.5 title^1.2 site^1.5

 /str

 str name=fl

 url

 /str

 str name=mm

 2lt;-1 5lt;-2 6lt;90%

 /str

 int name=ps100/int

 bool hl=true/

 str name=q.alt*:*/str

 str name=hl.fltitle url content/str

 str name=f.title.hl.fragsize0/str

 str name=f.title.hl.alternateFieldtitle/str

 str name=f.url.hl.fragsize0/str

 str name=f.url.hl.alternateFieldurl/str

 str name=f.content.hl.fragmenterregex/str

 /lst

 /requestHandler

 *3. Start Solr*

 cd apache-solr-1.3.0/example

 java -jar start.jar

 *4. Configure Nutch*

 *a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s
 contents with the following (we specify our crawler name, active plugins
 and limit maximum url count for single host per run to be 100) :*

 ?xml version=1.0?

Re: nutch and solr

2012-02-25 Thread alessio crisantemi
thi is the problem!
Becaus in my root there is a url!

I write you my step-by-step configuration of nutch:
(I use cygwin because I work on windows)

*1. Extract the Nutch package*

*2. Configure Solr*
(*Copy the provided Nutch schema from directory apache-nutch-1.0/conf to
directory apache-solr-1.3.0/example/solr/conf (override the existing file)
for *to allow Solr to create the snippets for search results so we need to
store the content in addition to indexing it:

*b. Change schema.xml so that the stored attribute of field “content” is
true.*

*field name=”content” type=”text” stored=”true” indexed=”true”/*

We want to be able to tweak the relevancy of queries easily so we’ll create
new dismax request handler configuration for our use case:

*d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste
following fragment to it*

requestHandler name=/nutch class=solr.SearchHandler 

lst name=defaults

str name=defTypedismax/str

str name=echoParamsexplicit/str

float name=tie0.01/float

str name=qf

content^0.5 anchor^1.0 title^1.2

/str

str name=pf

content^0.5 anchor^1.5 title^1.2 site^1.5

/str

str name=fl

url

/str

str name=mm

2lt;-1 5lt;-2 6lt;90%

/str

int name=ps100/int

bool hl=true/

str name=q.alt*:*/str

str name=hl.fltitle url content/str

str name=f.title.hl.fragsize0/str

str name=f.title.hl.alternateFieldtitle/str

str name=f.url.hl.fragsize0/str

str name=f.url.hl.alternateFieldurl/str

str name=f.content.hl.fragmenterregex/str

/lst

/requestHandler

*3. Start Solr*

cd apache-solr-1.3.0/example

java -jar start.jar

*4. Configure Nutch*

*a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s
contents with the following (we specify our crawler name, active plugins
and limit maximum url count for single host per run to be 100) :*

?xml version=1.0?

configuration

property

namehttp.agent.name/name

valuenutch-solr-integration/value

/property

property

namegenerate.max.per.host/name

value100/value

/property

property

nameplugin.includes/name

valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value

/property

/configuration

*b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf, replace
it’s content with following:*

-^(https|telnet|file|ftp|mailto):



# skip some suffixes

-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|
WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|
PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG
|bmp|BMP)$



# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]



# allow urls in foofactory.fi domain

+^http:*//([a-z0-9\-A-Z]*\.)*google.it/*



# deny anything *else*

-.

*5. Create a seed list (the initial urls to fetch)*

mkdir urls *(crea una cartella ‘urls’)*

echo http://www.google.it/;  urls/seed.txt

*6. Inject seed url(s) to nutch crawldb (execute in nutch directory)*

bin/nutch inject crawl/crawldb urls
AND HERE, THE MESSAGE ERROR about empty path. Why, in your opinion?
thank you
alessio

Il giorno 24 febbraio 2012 17:51, tamanjit.bin...@yahoo.co.in 
tamanjit.bin...@yahoo.co.in ha scritto:

 The empty path message is becayse nutch is unable to find a url in the url
 location that you provide.

 Kindly ensure there is a url there.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3773089.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: nutch and solr

2012-02-24 Thread tamanjit.bin...@yahoo.co.in
The empty path message is becayse nutch is unable to find a url in the url
location that you provide.

Kindly ensure there is a url there.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3773089.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: nutch and solr

2012-02-22 Thread alessio crisantemi
thanks for your reply, but don't work.
the same message: can't convert empty path

and more: impossible find class org.apache.nutch.crawl.injector

..


Il giorno 22 febbraio 2012 06:14, tamanjit.bin...@yahoo.co.in 
tamanjit.bin...@yahoo.co.in ha scritto:

 Try this command.

  bin/nutch crawl urls/folder name/url file.txt -dir crawl/folders
 name
 -threads 10 -depth 2 -topN 1000

 Your folder structure will look like this:

 nutch folder-- urls -- folder name-- url file.txt
|
|
 -- crawl -- folder name

 The folder name will be for different domains. So for each domain folder in
 urls folder there has to be a corresponding folder (with the same name) in
 the crawl folder.


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3765607.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: nutch and solr

2012-02-21 Thread tamanjit.bin...@yahoo.co.in
Try this command.

 bin/nutch crawl urls/folder name/url file.txt -dir crawl/folders name
-threads 10 -depth 2 -topN 1000

Your folder structure will look like this:

nutch folder-- urls -- folder name-- url file.txt
|
|
 -- crawl -- folder name

The folder name will be for different domains. So for each domain folder in
urls folder there has to be a corresponding folder (with the same name) in
the crawl folder.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3765607.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: nutch in solr

2012-02-05 Thread Matthew Parker
Doesn't tomcat run on port 8080, and not port 8983? Or did you change the
tomcat's default port to 8983?
On Feb 5, 2012 5:17 AM, alessio crisantemi alessio.crisant...@gmail.com
wrote:

 Hi All,
 I have some problems with integration of Nutch in Solr and Tomcat.

 I follo Nutch tutorial for integration and now, I can crawl a website: all
 works right.
 But It I try the solr integration, I can't indexing on Solr.

 follow the nutch output after the command:
 bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN 5

 I read java.lang.RuntimeException: Invalid version (expected 2, but 1) or
 the data in not in 'javabin' format
 MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1? MAY BE
 IT REQUIRE A 3.X SOLR VERSION?

 thanks,
 a.

 crawl started in: crawl-20120203151719
 rootUrlDir = urls
 threads = 10
 depth = 3
 solrUrl=http://127.0.0.1:8983/solr/
 topN = 5
 Injector: starting at 2012-02-03 15:17:20
 Injector: crawlDb: crawl-20120203151719/crawldb
 Injector: urlDir: urls
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10
 Generator: starting at 2012-02-03 15:17:31
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: topN: 5
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls for politeness.
 Generator: segment: crawl-20120203151719/segments/20120203151735
 Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07
 Fetcher: Your 'http.agent.name' value should be listed first in
 'http.robots.agents' property.
 Fetcher: starting at 2012-02-03 15:17:39
 Fetcher: segment: crawl-20120203151719/segments/20120203151735
 Using queue mode : byHost
 Fetcher: threads: 10
 Fetcher: time-out divisor: 2
 QueueFeeder finished: total 1 records + hit by time limit :0
 Using queue mode : byHost
 Using queue mode : byHost
 Using queue mode : byHost
 fetching http://www.gioconews.it/
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=3
 -finishing thread FetcherThread, activeThreads=2
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 Using queue mode : byHost
 Using queue mode : byHost
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Fetcher: throughput threshold: -1
 -finishing thread FetcherThread, activeThreads=1
 Fetcher: throughput threshold retries: 5
 -finishing thread FetcherThread, activeThreads=1
 -finishing thread FetcherThread, activeThreads=1
 fetch of http://www.gioconews.it/ failed with:
 java.net.UnknownHostException: www.gioconews.it
 -finishing thread FetcherThread, activeThreads=0
 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
 -activeThreads=0
 Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05
 ParseSegment: starting at 2012-02-03 15:17:44
 ParseSegment: segment: crawl-20120203151719/segments/20120203151735
 ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04
 CrawlDb update: starting at 2012-02-03 15:17:48
 CrawlDb update: db: crawl-20120203151719/crawldb
 CrawlDb update: segments: [crawl-20120203151719/segments/20120203151735]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: true
 CrawlDb update: URL filtering: true
 CrawlDb update: 404 purging: false
 CrawlDb update: Merging segment data into db.
 CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05
 Generator: starting at 2012-02-03 15:17:53
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: topN: 5
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: 0 records selected for fetching, exiting ...
 Stopping at depth=1 - no more URLs to fetch.
 LinkDb: starting at 2012-02-03 15:17:57
 LinkDb: linkdb: crawl-20120203151719/linkdb
 LinkDb: URL normalize: true
 LinkDb: URL filter: true
 LinkDb: adding segment:

 file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203151719/segments/20120203151735
 LinkDb: finished at 2012-02-03 15:18:01, elapsed: 00:00:04
 SolrIndexer: starting at 2012-02-03 15:18:01
 java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data
 in not in 'javabin' format
 SolrDeleteDuplicates: starting at 2012-02-03 15:18:09
 SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/
 Exception in thread main java.io.IOException:
 org.apache.solr.client.solrj.SolrServerException: Error executing query
at

 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:200)
at
 org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at
 

Re: nutch in solr

2012-02-05 Thread tamanjit.bin...@yahoo.co.in
alessio crisantemi-2,
I think you got it.. Check the jars in nutch lib and see if the solr n solrj
jars are same... That could be the issue

--
View this message in context: 
http://lucene.472066.n3.nabble.com/nutch-in-solr-tp3716969p3717542.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: nutch in solr

2012-02-05 Thread alessio crisantemi
no, all run on port 8983.
..

2012/2/5 Matthew Parker mpar...@apogeeintegration.com

 Doesn't tomcat run on port 8080, and not port 8983? Or did you change the
 tomcat's default port to 8983?
 On Feb 5, 2012 5:17 AM, alessio crisantemi alessio.crisant...@gmail.com
 
 wrote:

  Hi All,
  I have some problems with integration of Nutch in Solr and Tomcat.
 
  I follo Nutch tutorial for integration and now, I can crawl a website:
 all
  works right.
  But It I try the solr integration, I can't indexing on Solr.
 
  follow the nutch output after the command:
  bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN 5
 
  I read java.lang.RuntimeException: Invalid version (expected 2, but 1)
 or
  the data in not in 'javabin' format
  MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1? MAY
 BE
  IT REQUIRE A 3.X SOLR VERSION?
 
  thanks,
  a.
 
  crawl started in: crawl-20120203151719
  rootUrlDir = urls
  threads = 10
  depth = 3
  solrUrl=http://127.0.0.1:8983/solr/
  topN = 5
  Injector: starting at 2012-02-03 15:17:20
  Injector: crawlDb: crawl-20120203151719/crawldb
  Injector: urlDir: urls
  Injector: Converting injected urls to crawl db entries.
  Injector: Merging injected urls into crawl db.
  Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10
  Generator: starting at 2012-02-03 15:17:31
  Generator: Selecting best-scoring urls due for fetch.
  Generator: filtering: true
  Generator: normalizing: true
  Generator: topN: 5
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: Partitioning selected urls for politeness.
  Generator: segment: crawl-20120203151719/segments/20120203151735
  Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07
  Fetcher: Your 'http.agent.name' value should be listed first in
  'http.robots.agents' property.
  Fetcher: starting at 2012-02-03 15:17:39
  Fetcher: segment: crawl-20120203151719/segments/20120203151735
  Using queue mode : byHost
  Fetcher: threads: 10
  Fetcher: time-out divisor: 2
  QueueFeeder finished: total 1 records + hit by time limit :0
  Using queue mode : byHost
  Using queue mode : byHost
  Using queue mode : byHost
  fetching http://www.gioconews.it/
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=3
  -finishing thread FetcherThread, activeThreads=2
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  Using queue mode : byHost
  Using queue mode : byHost
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  Fetcher: throughput threshold: -1
  -finishing thread FetcherThread, activeThreads=1
  Fetcher: throughput threshold retries: 5
  -finishing thread FetcherThread, activeThreads=1
  -finishing thread FetcherThread, activeThreads=1
  fetch of http://www.gioconews.it/ failed with:
  java.net.UnknownHostException: www.gioconews.it
  -finishing thread FetcherThread, activeThreads=0
  -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
  -activeThreads=0
  Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05
  ParseSegment: starting at 2012-02-03 15:17:44
  ParseSegment: segment: crawl-20120203151719/segments/20120203151735
  ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04
  CrawlDb update: starting at 2012-02-03 15:17:48
  CrawlDb update: db: crawl-20120203151719/crawldb
  CrawlDb update: segments: [crawl-20120203151719/segments/20120203151735]
  CrawlDb update: additions allowed: true
  CrawlDb update: URL normalizing: true
  CrawlDb update: URL filtering: true
  CrawlDb update: 404 purging: false
  CrawlDb update: Merging segment data into db.
  CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05
  Generator: starting at 2012-02-03 15:17:53
  Generator: Selecting best-scoring urls due for fetch.
  Generator: filtering: true
  Generator: normalizing: true
  Generator: topN: 5
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: 0 records selected for fetching, exiting ...
  Stopping at depth=1 - no more URLs to fetch.
  LinkDb: starting at 2012-02-03 15:17:57
  LinkDb: linkdb: crawl-20120203151719/linkdb
  LinkDb: URL normalize: true
  LinkDb: URL filter: true
  LinkDb: adding segment:
 
 
 file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203151719/segments/20120203151735
  LinkDb: finished at 2012-02-03 15:18:01, elapsed: 00:00:04
  SolrIndexer: starting at 2012-02-03 15:18:01
  java.lang.RuntimeException: Invalid version (expected 2, but 1) or the
 data
  in not in 'javabin' format
  SolrDeleteDuplicates: starting at 2012-02-03 15:18:09
  SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/
  Exception in thread main java.io.IOException:
  org.apache.solr.client.solrj.SolrServerException: Error executing query
 at
 
 
 

Re: nutch in solr

2012-02-05 Thread Matthew Parker
No, they all don't run on 8983.

Tomcat's default port is 8080.

If you're using the embedded server in SOLR, you are using Jetty, which
runs on port 8983.

On Sun, Feb 5, 2012 at 11:54 AM, alessio crisantemi 
alessio.crisant...@gmail.com wrote:

 no, all run on port 8983.
 ..

 2012/2/5 Matthew Parker mpar...@apogeeintegration.com

  Doesn't tomcat run on port 8080, and not port 8983? Or did you change the
  tomcat's default port to 8983?
  On Feb 5, 2012 5:17 AM, alessio crisantemi 
 alessio.crisant...@gmail.com
  
  wrote:
 
   Hi All,
   I have some problems with integration of Nutch in Solr and Tomcat.
  
   I follo Nutch tutorial for integration and now, I can crawl a website:
  all
   works right.
   But It I try the solr integration, I can't indexing on Solr.
  
   follow the nutch output after the command:
   bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN
 5
  
   I read java.lang.RuntimeException: Invalid version (expected 2, but 1)
  or
   the data in not in 'javabin' format
   MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1? MAY
  BE
   IT REQUIRE A 3.X SOLR VERSION?
  
   thanks,
   a.
  
   crawl started in: crawl-20120203151719
   rootUrlDir = urls
   threads = 10
   depth = 3
   solrUrl=http://127.0.0.1:8983/solr/
   topN = 5
   Injector: starting at 2012-02-03 15:17:20
   Injector: crawlDb: crawl-20120203151719/crawldb
   Injector: urlDir: urls
   Injector: Converting injected urls to crawl db entries.
   Injector: Merging injected urls into crawl db.
   Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10
   Generator: starting at 2012-02-03 15:17:31
   Generator: Selecting best-scoring urls due for fetch.
   Generator: filtering: true
   Generator: normalizing: true
   Generator: topN: 5
   Generator: jobtracker is 'local', generating exactly one partition.
   Generator: Partitioning selected urls for politeness.
   Generator: segment: crawl-20120203151719/segments/20120203151735
   Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07
   Fetcher: Your 'http.agent.name' value should be listed first in
   'http.robots.agents' property.
   Fetcher: starting at 2012-02-03 15:17:39
   Fetcher: segment: crawl-20120203151719/segments/20120203151735
   Using queue mode : byHost
   Fetcher: threads: 10
   Fetcher: time-out divisor: 2
   QueueFeeder finished: total 1 records + hit by time limit :0
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   fetching http://www.gioconews.it/
   Using queue mode : byHost
   -finishing thread FetcherThread, activeThreads=3
   -finishing thread FetcherThread, activeThreads=2
   -finishing thread FetcherThread, activeThreads=1
   Using queue mode : byHost
   Using queue mode : byHost
   -finishing thread FetcherThread, activeThreads=1
   -finishing thread FetcherThread, activeThreads=1
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   -finishing thread FetcherThread, activeThreads=1
   Fetcher: throughput threshold: -1
   -finishing thread FetcherThread, activeThreads=1
   Fetcher: throughput threshold retries: 5
   -finishing thread FetcherThread, activeThreads=1
   -finishing thread FetcherThread, activeThreads=1
   fetch of http://www.gioconews.it/ failed with:
   java.net.UnknownHostException: www.gioconews.it
   -finishing thread FetcherThread, activeThreads=0
   -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
   -activeThreads=0
   Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05
   ParseSegment: starting at 2012-02-03 15:17:44
   ParseSegment: segment: crawl-20120203151719/segments/20120203151735
   ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04
   CrawlDb update: starting at 2012-02-03 15:17:48
   CrawlDb update: db: crawl-20120203151719/crawldb
   CrawlDb update: segments:
 [crawl-20120203151719/segments/20120203151735]
   CrawlDb update: additions allowed: true
   CrawlDb update: URL normalizing: true
   CrawlDb update: URL filtering: true
   CrawlDb update: 404 purging: false
   CrawlDb update: Merging segment data into db.
   CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05
   Generator: starting at 2012-02-03 15:17:53
   Generator: Selecting best-scoring urls due for fetch.
   Generator: filtering: true
   Generator: normalizing: true
   Generator: topN: 5
   Generator: jobtracker is 'local', generating exactly one partition.
   Generator: 0 records selected for fetching, exiting ...
   Stopping at depth=1 - no more URLs to fetch.
   LinkDb: starting at 2012-02-03 15:17:57
   LinkDb: linkdb: crawl-20120203151719/linkdb
   LinkDb: URL normalize: true
   LinkDb: URL filter: true
   LinkDb: adding segment:
  
  
 
 file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203151719/segments/20120203151735
   LinkDb: finished at 2012-02-03 15:18:01, elapsed: 00:00:04
   SolrIndexer: starting at 2012-02-03 15:18:01
   

Re: nutch in solr

2012-02-05 Thread Geek Gamer
looks like solrj version in nutch classpath is different that the solr
version on server,
can you  post the versions for both nutch and solr?


On Sun, Feb 5, 2012 at 10:24 PM, alessio crisantemi
alessio.crisant...@gmail.com wrote:
 no, all run on port 8983.
 ..

 2012/2/5 Matthew Parker mpar...@apogeeintegration.com

 Doesn't tomcat run on port 8080, and not port 8983? Or did you change the
 tomcat's default port to 8983?
 On Feb 5, 2012 5:17 AM, alessio crisantemi alessio.crisant...@gmail.com
 
 wrote:

  Hi All,
  I have some problems with integration of Nutch in Solr and Tomcat.
 
  I follo Nutch tutorial for integration and now, I can crawl a website:
 all
  works right.
  But It I try the solr integration, I can't indexing on Solr.
 
  follow the nutch output after the command:
  bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN 5
 
  I read java.lang.RuntimeException: Invalid version (expected 2, but 1)
 or
  the data in not in 'javabin' format
  MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1? MAY
 BE
  IT REQUIRE A 3.X SOLR VERSION?
 
  thanks,
  a.
 
  crawl started in: crawl-20120203151719
  rootUrlDir = urls
  threads = 10
  depth = 3
  solrUrl=http://127.0.0.1:8983/solr/
  topN = 5
  Injector: starting at 2012-02-03 15:17:20
  Injector: crawlDb: crawl-20120203151719/crawldb
  Injector: urlDir: urls
  Injector: Converting injected urls to crawl db entries.
  Injector: Merging injected urls into crawl db.
  Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10
  Generator: starting at 2012-02-03 15:17:31
  Generator: Selecting best-scoring urls due for fetch.
  Generator: filtering: true
  Generator: normalizing: true
  Generator: topN: 5
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: Partitioning selected urls for politeness.
  Generator: segment: crawl-20120203151719/segments/20120203151735
  Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07
  Fetcher: Your 'http.agent.name' value should be listed first in
  'http.robots.agents' property.
  Fetcher: starting at 2012-02-03 15:17:39
  Fetcher: segment: crawl-20120203151719/segments/20120203151735
  Using queue mode : byHost
  Fetcher: threads: 10
  Fetcher: time-out divisor: 2
  QueueFeeder finished: total 1 records + hit by time limit :0
  Using queue mode : byHost
  Using queue mode : byHost
  Using queue mode : byHost
  fetching http://www.gioconews.it/
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=3
  -finishing thread FetcherThread, activeThreads=2
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  -finishing thread FetcherThread, activeThreads=1
  Using queue mode : byHost
  Using queue mode : byHost
  Using queue mode : byHost
  Using queue mode : byHost
  -finishing thread FetcherThread, activeThreads=1
  Fetcher: throughput threshold: -1
  -finishing thread FetcherThread, activeThreads=1
  Fetcher: throughput threshold retries: 5
  -finishing thread FetcherThread, activeThreads=1
  -finishing thread FetcherThread, activeThreads=1
  fetch of http://www.gioconews.it/ failed with:
  java.net.UnknownHostException: www.gioconews.it
  -finishing thread FetcherThread, activeThreads=0
  -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
  -activeThreads=0
  Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05
  ParseSegment: starting at 2012-02-03 15:17:44
  ParseSegment: segment: crawl-20120203151719/segments/20120203151735
  ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04
  CrawlDb update: starting at 2012-02-03 15:17:48
  CrawlDb update: db: crawl-20120203151719/crawldb
  CrawlDb update: segments: [crawl-20120203151719/segments/20120203151735]
  CrawlDb update: additions allowed: true
  CrawlDb update: URL normalizing: true
  CrawlDb update: URL filtering: true
  CrawlDb update: 404 purging: false
  CrawlDb update: Merging segment data into db.
  CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05
  Generator: starting at 2012-02-03 15:17:53
  Generator: Selecting best-scoring urls due for fetch.
  Generator: filtering: true
  Generator: normalizing: true
  Generator: topN: 5
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: 0 records selected for fetching, exiting ...
  Stopping at depth=1 - no more URLs to fetch.
  LinkDb: starting at 2012-02-03 15:17:57
  LinkDb: linkdb: crawl-20120203151719/linkdb
  LinkDb: URL normalize: true
  LinkDb: URL filter: true
  LinkDb: adding segment:
 
 
 file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203151719/segments/20120203151735
  LinkDb: finished at 2012-02-03 15:18:01, elapsed: 00:00:04
  SolrIndexer: starting at 2012-02-03 15:18:01
  java.lang.RuntimeException: Invalid version (expected 2, but 1) or the
 data
  in not in 'javabin' format
  SolrDeleteDuplicates: starting at 

Re: nutch in solr

2012-02-05 Thread alessio crisantemi
if I look the solr and nuth libs I found:
apache-solr-solrj-1.4.1.jar on Solr
and
solr-solrj-3.4.0.jar

this are the only jar files with a word 'solrj'
taht's the problem?!

2012/2/5 Geek Gamer geek4...@gmail.com

 looks like solrj version in nutch classpath is different that the solr
 version on server,
 can you  post the versions for both nutch and solr?


 On Sun, Feb 5, 2012 at 10:24 PM, alessio crisantemi
 alessio.crisant...@gmail.com wrote:
  no, all run on port 8983.
  ..
 
  2012/2/5 Matthew Parker mpar...@apogeeintegration.com
 
  Doesn't tomcat run on port 8080, and not port 8983? Or did you change
 the
  tomcat's default port to 8983?
  On Feb 5, 2012 5:17 AM, alessio crisantemi 
 alessio.crisant...@gmail.com
  
  wrote:
 
   Hi All,
   I have some problems with integration of Nutch in Solr and Tomcat.
  
   I follo Nutch tutorial for integration and now, I can crawl a website:
  all
   works right.
   But It I try the solr integration, I can't indexing on Solr.
  
   follow the nutch output after the command:
   bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3
 -topN 5
  
   I read java.lang.RuntimeException: Invalid version (expected 2, but
 1)
  or
   the data in not in 'javabin' format
   MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1?
 MAY
  BE
   IT REQUIRE A 3.X SOLR VERSION?
  
   thanks,
   a.
  
   crawl started in: crawl-20120203151719
   rootUrlDir = urls
   threads = 10
   depth = 3
   solrUrl=http://127.0.0.1:8983/solr/
   topN = 5
   Injector: starting at 2012-02-03 15:17:20
   Injector: crawlDb: crawl-20120203151719/crawldb
   Injector: urlDir: urls
   Injector: Converting injected urls to crawl db entries.
   Injector: Merging injected urls into crawl db.
   Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10
   Generator: starting at 2012-02-03 15:17:31
   Generator: Selecting best-scoring urls due for fetch.
   Generator: filtering: true
   Generator: normalizing: true
   Generator: topN: 5
   Generator: jobtracker is 'local', generating exactly one partition.
   Generator: Partitioning selected urls for politeness.
   Generator: segment: crawl-20120203151719/segments/20120203151735
   Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07
   Fetcher: Your 'http.agent.name' value should be listed first in
   'http.robots.agents' property.
   Fetcher: starting at 2012-02-03 15:17:39
   Fetcher: segment: crawl-20120203151719/segments/20120203151735
   Using queue mode : byHost
   Fetcher: threads: 10
   Fetcher: time-out divisor: 2
   QueueFeeder finished: total 1 records + hit by time limit :0
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   fetching http://www.gioconews.it/
   Using queue mode : byHost
   -finishing thread FetcherThread, activeThreads=3
   -finishing thread FetcherThread, activeThreads=2
   -finishing thread FetcherThread, activeThreads=1
   Using queue mode : byHost
   Using queue mode : byHost
   -finishing thread FetcherThread, activeThreads=1
   -finishing thread FetcherThread, activeThreads=1
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   -finishing thread FetcherThread, activeThreads=1
   Fetcher: throughput threshold: -1
   -finishing thread FetcherThread, activeThreads=1
   Fetcher: throughput threshold retries: 5
   -finishing thread FetcherThread, activeThreads=1
   -finishing thread FetcherThread, activeThreads=1
   fetch of http://www.gioconews.it/ failed with:
   java.net.UnknownHostException: www.gioconews.it
   -finishing thread FetcherThread, activeThreads=0
   -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
   -activeThreads=0
   Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05
   ParseSegment: starting at 2012-02-03 15:17:44
   ParseSegment: segment: crawl-20120203151719/segments/20120203151735
   ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04
   CrawlDb update: starting at 2012-02-03 15:17:48
   CrawlDb update: db: crawl-20120203151719/crawldb
   CrawlDb update: segments:
 [crawl-20120203151719/segments/20120203151735]
   CrawlDb update: additions allowed: true
   CrawlDb update: URL normalizing: true
   CrawlDb update: URL filtering: true
   CrawlDb update: 404 purging: false
   CrawlDb update: Merging segment data into db.
   CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05
   Generator: starting at 2012-02-03 15:17:53
   Generator: Selecting best-scoring urls due for fetch.
   Generator: filtering: true
   Generator: normalizing: true
   Generator: topN: 5
   Generator: jobtracker is 'local', generating exactly one partition.
   Generator: 0 records selected for fetching, exiting ...
   Stopping at depth=1 - no more URLs to fetch.
   LinkDb: starting at 2012-02-03 15:17:57
   LinkDb: linkdb: crawl-20120203151719/linkdb
   LinkDb: URL normalize: true
   LinkDb: URL filter: true
   LinkDb: adding segment:
  
  
 
 

Re: nutch in solr

2012-02-05 Thread Geek Gamer
solj is the solr java client library,

so there seem to be two versions 1.4.1 and 3.4.0, which are
incompatible,  so you can do the following,

refer : 
https://github.com/geek4377/nutch/commit/c66bf35ff4f86393413621b3b889b1c78281df4d

to see how to upgrade the solr version in nutch, teh above example
replaces solr 1.4.0 with 3.1.0.




On Sun, Feb 5, 2012 at 11:02 PM, alessio crisantemi
alessio.crisant...@gmail.com wrote:
 if I look the solr and nuth libs I found:
 apache-solr-solrj-1.4.1.jar on Solr
 and
 solr-solrj-3.4.0.jar

 this are the only jar files with a word 'solrj'
 taht's the problem?!

 2012/2/5 Geek Gamer geek4...@gmail.com

 looks like solrj version in nutch classpath is different that the solr
 version on server,
 can you  post the versions for both nutch and solr?


 On Sun, Feb 5, 2012 at 10:24 PM, alessio crisantemi
 alessio.crisant...@gmail.com wrote:
  no, all run on port 8983.
  ..
 
  2012/2/5 Matthew Parker mpar...@apogeeintegration.com
 
  Doesn't tomcat run on port 8080, and not port 8983? Or did you change
 the
  tomcat's default port to 8983?
  On Feb 5, 2012 5:17 AM, alessio crisantemi 
 alessio.crisant...@gmail.com
  
  wrote:
 
   Hi All,
   I have some problems with integration of Nutch in Solr and Tomcat.
  
   I follo Nutch tutorial for integration and now, I can crawl a website:
  all
   works right.
   But It I try the solr integration, I can't indexing on Solr.
  
   follow the nutch output after the command:
   bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3
 -topN 5
  
   I read java.lang.RuntimeException: Invalid version (expected 2, but
 1)
  or
   the data in not in 'javabin' format
   MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1?
 MAY
  BE
   IT REQUIRE A 3.X SOLR VERSION?
  
   thanks,
   a.
  
   crawl started in: crawl-20120203151719
   rootUrlDir = urls
   threads = 10
   depth = 3
   solrUrl=http://127.0.0.1:8983/solr/
   topN = 5
   Injector: starting at 2012-02-03 15:17:20
   Injector: crawlDb: crawl-20120203151719/crawldb
   Injector: urlDir: urls
   Injector: Converting injected urls to crawl db entries.
   Injector: Merging injected urls into crawl db.
   Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10
   Generator: starting at 2012-02-03 15:17:31
   Generator: Selecting best-scoring urls due for fetch.
   Generator: filtering: true
   Generator: normalizing: true
   Generator: topN: 5
   Generator: jobtracker is 'local', generating exactly one partition.
   Generator: Partitioning selected urls for politeness.
   Generator: segment: crawl-20120203151719/segments/20120203151735
   Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07
   Fetcher: Your 'http.agent.name' value should be listed first in
   'http.robots.agents' property.
   Fetcher: starting at 2012-02-03 15:17:39
   Fetcher: segment: crawl-20120203151719/segments/20120203151735
   Using queue mode : byHost
   Fetcher: threads: 10
   Fetcher: time-out divisor: 2
   QueueFeeder finished: total 1 records + hit by time limit :0
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   fetching http://www.gioconews.it/
   Using queue mode : byHost
   -finishing thread FetcherThread, activeThreads=3
   -finishing thread FetcherThread, activeThreads=2
   -finishing thread FetcherThread, activeThreads=1
   Using queue mode : byHost
   Using queue mode : byHost
   -finishing thread FetcherThread, activeThreads=1
   -finishing thread FetcherThread, activeThreads=1
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   Using queue mode : byHost
   -finishing thread FetcherThread, activeThreads=1
   Fetcher: throughput threshold: -1
   -finishing thread FetcherThread, activeThreads=1
   Fetcher: throughput threshold retries: 5
   -finishing thread FetcherThread, activeThreads=1
   -finishing thread FetcherThread, activeThreads=1
   fetch of http://www.gioconews.it/ failed with:
   java.net.UnknownHostException: www.gioconews.it
   -finishing thread FetcherThread, activeThreads=0
   -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
   -activeThreads=0
   Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05
   ParseSegment: starting at 2012-02-03 15:17:44
   ParseSegment: segment: crawl-20120203151719/segments/20120203151735
   ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04
   CrawlDb update: starting at 2012-02-03 15:17:48
   CrawlDb update: db: crawl-20120203151719/crawldb
   CrawlDb update: segments:
 [crawl-20120203151719/segments/20120203151735]
   CrawlDb update: additions allowed: true
   CrawlDb update: URL normalizing: true
   CrawlDb update: URL filtering: true
   CrawlDb update: 404 purging: false
   CrawlDb update: Merging segment data into db.
   CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05
   Generator: starting at 2012-02-03 15:17:53
   Generator: Selecting best-scoring urls due for fetch.
   Generator: 

Re: nutch in solr

2012-02-05 Thread alessio crisantemi
tx, I try and write the result asap
a.

2012/2/5 Geek Gamer geek4...@gmail.com

 solj is the solr java client library,

 so there seem to be two versions 1.4.1 and 3.4.0, which are
 incompatible,  so you can do the following,

 refer :
 https://github.com/geek4377/nutch/commit/c66bf35ff4f86393413621b3b889b1c78281df4d

 to see how to upgrade the solr version in nutch, teh above example
 replaces solr 1.4.0 with 3.1.0.




 On Sun, Feb 5, 2012 at 11:02 PM, alessio crisantemi
 alessio.crisant...@gmail.com wrote:
  if I look the solr and nuth libs I found:
  apache-solr-solrj-1.4.1.jar on Solr
  and
  solr-solrj-3.4.0.jar
 
  this are the only jar files with a word 'solrj'
  taht's the problem?!
 
  2012/2/5 Geek Gamer geek4...@gmail.com
 
  looks like solrj version in nutch classpath is different that the solr
  version on server,
  can you  post the versions for both nutch and solr?
 
 
  On Sun, Feb 5, 2012 at 10:24 PM, alessio crisantemi
  alessio.crisant...@gmail.com wrote:
   no, all run on port 8983.
   ..
  
   2012/2/5 Matthew Parker mpar...@apogeeintegration.com
  
   Doesn't tomcat run on port 8080, and not port 8983? Or did you change
  the
   tomcat's default port to 8983?
   On Feb 5, 2012 5:17 AM, alessio crisantemi 
  alessio.crisant...@gmail.com
   
   wrote:
  
Hi All,
I have some problems with integration of Nutch in Solr and Tomcat.
   
I follo Nutch tutorial for integration and now, I can crawl a
 website:
   all
works right.
But It I try the solr integration, I can't indexing on Solr.
   
follow the nutch output after the command:
bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3
  -topN 5
   
I read java.lang.RuntimeException: Invalid version (expected 2,
 but
  1)
   or
the data in not in 'javabin' format
MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1?
  MAY
   BE
IT REQUIRE A 3.X SOLR VERSION?
   
thanks,
a.
   
crawl started in: crawl-20120203151719
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=http://127.0.0.1:8983/solr/
topN = 5
Injector: starting at 2012-02-03 15:17:20
Injector: crawlDb: crawl-20120203151719/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10
Generator: starting at 2012-02-03 15:17:31
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl-20120203151719/segments/20120203151735
Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2012-02-03 15:17:39
Fetcher: segment: crawl-20120203151719/segments/20120203151735
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.gioconews.it/
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
fetch of http://www.gioconews.it/ failed with:
java.net.UnknownHostException: www.gioconews.it
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05
ParseSegment: starting at 2012-02-03 15:17:44
ParseSegment: segment: crawl-20120203151719/segments/20120203151735
ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04
CrawlDb update: starting at 2012-02-03 15:17:48
CrawlDb update: db: crawl-20120203151719/crawldb
CrawlDb update: segments:
  [crawl-20120203151719/segments/20120203151735]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging 

Re: nutch 1.2, solr 3.3, tomcat6. java.io.IOException: Job failed! problem when building solrindex

2011-07-13 Thread Geek Gamer
you need to update the solrj libs to 3.x version. the java bin format
has changed .
I made the change a few months back, you can pull the changes from
https://github.com/geek4377/nutch/tree/geek5377-1.2.1

hope that helps,


On Wed, Jul 13, 2011 at 8:58 AM, Leo Subscriptions
llsub...@zudiewiener.com wrote:
 I'm running 64bit Ubuntu 11.04, nutch 1.2, solr 3.3 (downloaded, not
 built) and tomcat6 following this (and some other) links
 http://wiki.apache.org/nutch/RunningNutchAndSolr

 I have added the nutch schema and can access/view this schema via the
 admin page. nutch also works as I can perfrom successful searches.

 When I execute the following:

 ./bin/nutch solrindex http://localhost:8080/solr/core0 crawl/crawldb
 crawl/linkdb crawl/segments/*

 I (eventually) get an io error.

 Tha above command creates the following
 files /var/lib/tomcat6/solr/core0/data/index/

 ---
 544 -rw-r--r-- 1 tomcat6 tomcat6 557056 2011-07-13 11:09 _1.fdt
  0 -rw-r--r-- 1 tomcat6 tomcat6      0 2011-07-13 11:00 _1.fdx
  4 -rw-r--r-- 1 tomcat6 tomcat6     32 2011-07-13 10:59 segments_2
  4 -rw-r--r-- 1 tomcat6 tomcat6     20 2011-07-13 10:59 segments.gen
  0 -rw-r--r-- 1 tomcat6 tomcat6      0 2011-07-13 11:00 write.lock
 ---

 but the hadoop.log reports the following error

 ---
 2011-07-13 11:09:47,665 INFO  indexer.IndexingFilters - Adding
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2011-07-13 11:09:47,666 INFO  indexer.IndexingFilters - Adding
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: content
 dest: content
 2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: site
 dest: site
 2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: title
 dest: title
 2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: host
 dest: host
 2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: segment
 dest: segment
 2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: boost
 dest: boost
 2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: digest
 dest: digest
 2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: tstamp
 dest: tstamp
 2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: url dest:
 id
 2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: url dest:
 url
 2011-07-13 11:09:49,272 WARN  mapred.LocalJobRunner - job_local_0001
 java.lang.RuntimeException: Invalid version or the data in not in
 'javabin' format
        at
 org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
        at
 org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39)
        at
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466)
        at
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
        at
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
        at
 org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
        at
 org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:64)
        at org.apache.nutch.indexer.IndexerOutputFormat
 $1.write(IndexerOutputFormat.java:54)
        at org.apache.nutch.indexer.IndexerOutputFormat
 $1.write(IndexerOutputFormat.java:44)
        at org.apache.hadoop.mapred.ReduceTask
 $3.collect(ReduceTask.java:440)
        at
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:159)
        at
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
        at
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
        at org.apache.hadoop.mapred.LocalJobRunner
 $Job.run(LocalJobRunner.java:216)
 2011-07-13 11:09:49,611 ERROR solr.SolrIndexer - java.io.IOException:
 Job failed!
 ---

 I'd appreciate any help with this.

 Thanks,

 Leo






Re: nutch 1.2, solr 3.3, tomcat6. java.io.IOException: Job failed! problem when building solrindex

2011-07-13 Thread Leo Subscriptions
Works like a charm.

Thanks,

Leo

On Wed, 2011-07-13 at 11:31 +0530, Geek Gamer wrote:

 you need to update the solrj libs to 3.x version. the java bin format
 has changed .
 I made the change a few months back, you can pull the changes from
 https://github.com/geek4377/nutch/tree/geek5377-1.2.1
 
 hope that helps,
 
 
 On Wed, Jul 13, 2011 at 8:58 AM, Leo Subscriptions
 llsub...@zudiewiener.com wrote:
  I'm running 64bit Ubuntu 11.04, nutch 1.2, solr 3.3 (downloaded, not
  built) and tomcat6 following this (and some other) links
  http://wiki.apache.org/nutch/RunningNutchAndSolr
 
  I have added the nutch schema and can access/view this schema via the
  admin page. nutch also works as I can perfrom successful searches.
 
  When I execute the following:
 
  ./bin/nutch solrindex http://localhost:8080/solr/core0 crawl/crawldb
  crawl/linkdb crawl/segments/*
 
  I (eventually) get an io error.
 
  Tha above command creates the following
  files /var/lib/tomcat6/solr/core0/data/index/
 
  ---
  544 -rw-r--r-- 1 tomcat6 tomcat6 557056 2011-07-13 11:09 _1.fdt
   0 -rw-r--r-- 1 tomcat6 tomcat6  0 2011-07-13 11:00 _1.fdx
   4 -rw-r--r-- 1 tomcat6 tomcat6 32 2011-07-13 10:59 segments_2
   4 -rw-r--r-- 1 tomcat6 tomcat6 20 2011-07-13 10:59 segments.gen
   0 -rw-r--r-- 1 tomcat6 tomcat6  0 2011-07-13 11:00 write.lock
  ---
 
  but the hadoop.log reports the following error
 
  ---
  2011-07-13 11:09:47,665 INFO  indexer.IndexingFilters - Adding
  org.apache.nutch.indexer.basic.BasicIndexingFilter
  2011-07-13 11:09:47,666 INFO  indexer.IndexingFilters - Adding
  org.apache.nutch.indexer.anchor.AnchorIndexingFilter
  2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: content
  dest: content
  2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: site
  dest: site
  2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: title
  dest: title
  2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: host
  dest: host
  2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: segment
  dest: segment
  2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: boost
  dest: boost
  2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: digest
  dest: digest
  2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: tstamp
  dest: tstamp
  2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: url dest:
  id
  2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: url dest:
  url
  2011-07-13 11:09:49,272 WARN  mapred.LocalJobRunner - job_local_0001
  java.lang.RuntimeException: Invalid version or the data in not in
  'javabin' format
 at
  org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
 at
  org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39)
 at
  org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466)
 at
  org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
 at
  org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
 at
  org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
 at
  org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:64)
 at org.apache.nutch.indexer.IndexerOutputFormat
  $1.write(IndexerOutputFormat.java:54)
 at org.apache.nutch.indexer.IndexerOutputFormat
  $1.write(IndexerOutputFormat.java:44)
 at org.apache.hadoop.mapred.ReduceTask
  $3.collect(ReduceTask.java:440)
 at
  org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:159)
 at
  org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
 at
  org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
 at org.apache.hadoop.mapred.LocalJobRunner
  $Job.run(LocalJobRunner.java:216)
  2011-07-13 11:09:49,611 ERROR solr.SolrIndexer - java.io.IOException:
  Job failed!
  ---
 
  I'd appreciate any help with this.
 
  Thanks,
 
  Leo
 
 
 
 




Re: nutch 1.2, solr 3.3, tomcat6. java.io.IOException: Job failed! problem when building solrindex

2011-07-13 Thread Markus Jelsma
If you're using Solr anyway, you'd better upgrade to Nutch 1.3 with Solr 3.x 
support.

 Works like a charm.
 
 Thanks,
 
 Leo
 
 On Wed, 2011-07-13 at 11:31 +0530, Geek Gamer wrote:
  you need to update the solrj libs to 3.x version. the java bin format
  has changed .
  I made the change a few months back, you can pull the changes from
  https://github.com/geek4377/nutch/tree/geek5377-1.2.1
  
  hope that helps,
  
  
  On Wed, Jul 13, 2011 at 8:58 AM, Leo Subscriptions
  
  llsub...@zudiewiener.com wrote:
   I'm running 64bit Ubuntu 11.04, nutch 1.2, solr 3.3 (downloaded, not
   built) and tomcat6 following this (and some other) links
   http://wiki.apache.org/nutch/RunningNutchAndSolr
   
   I have added the nutch schema and can access/view this schema via the
   admin page. nutch also works as I can perfrom successful searches.
   
   When I execute the following:
   ./bin/nutch solrindex http://localhost:8080/solr/core0 crawl/crawldb
   
   crawl/linkdb crawl/segments/*
   
   I (eventually) get an io error.
   
   Tha above command creates the following
   files /var/lib/tomcat6/solr/core0/data/index/
   
   ---
   544 -rw-r--r-- 1 tomcat6 tomcat6 557056 2011-07-13 11:09 _1.fdt
   
0 -rw-r--r-- 1 tomcat6 tomcat6  0 2011-07-13 11:00 _1.fdx
4 -rw-r--r-- 1 tomcat6 tomcat6 32 2011-07-13 10:59 segments_2
4 -rw-r--r-- 1 tomcat6 tomcat6 20 2011-07-13 10:59 segments.gen
0 -rw-r--r-- 1 tomcat6 tomcat6  0 2011-07-13 11:00 write.lock
   
   ---
   
   but the hadoop.log reports the following error
   
   ---
   2011-07-13 11:09:47,665 INFO  indexer.IndexingFilters - Adding
   org.apache.nutch.indexer.basic.BasicIndexingFilter
   2011-07-13 11:09:47,666 INFO  indexer.IndexingFilters - Adding
   org.apache.nutch.indexer.anchor.AnchorIndexingFilter
   2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: content
   dest: content
   2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: site
   dest: site
   2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: title
   dest: title
   2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: host
   dest: host
   2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: segment
   dest: segment
   2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: boost
   dest: boost
   2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: digest
   dest: digest
   2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: tstamp
   dest: tstamp
   2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: url
   dest: id
   2011-07-13 11:09:47,690 INFO  solr.SolrMappingReader - source: url
   dest: url
   2011-07-13 11:09:49,272 WARN  mapred.LocalJobRunner - job_local_0001
   java.lang.RuntimeException: Invalid version or the data in not in
   'javabin' format
   
  at
   
   org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99
   )
   
  at
   
   org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(
   BinaryResponseParser.java:39)
   
  at
   
   org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Commons
   HttpSolrServer.java:466)
   
  at
   
   org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Commons
   HttpSolrServer.java:243)
   
  at
   
   org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abst
   ractUpdateRequest.java:105)
   
  at
   
   org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
   
  at
   
   org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:64)
   
  at org.apache.nutch.indexer.IndexerOutputFormat
   
   $1.write(IndexerOutputFormat.java:54)
   
  at org.apache.nutch.indexer.IndexerOutputFormat
   
   $1.write(IndexerOutputFormat.java:44)
   
  at org.apache.hadoop.mapred.ReduceTask
   
   $3.collect(ReduceTask.java:440)
   
  at
   
   org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
   159)
   
  at
   
   org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
   50)
   
  at
   
   org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
   
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
  at org.apache.hadoop.mapred.LocalJobRunner
   
   $Job.run(LocalJobRunner.java:216)
   2011-07-13 11:09:49,611 ERROR solr.SolrIndexer - java.io.IOException:
   Job failed!
   ---
   ---
   -
   
   I'd appreciate any help with this.
   
   Thanks,
   
   Leo


Re: Nutch and Solr search on the fly

2011-02-09 Thread Markus Jelsma
The parsed data is only sent to the Solr index of you tell a segment to be 
indexed; solrindex crawldb linkdb segment

If you did this only once after injecting  and then the consequent 
fetch,parse,update,index sequence then you, of course, only see those URL's. 
If you don't index a segment after it's being parsed, you need to do it later 
on.

On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
 Hi all,
 
  I am a newbie to nutch and solr. Well relatively much newer to Solr than
 Nutch :)
 
  I have been using nutch for past two weeks, and I wanted to know if I can
 query or search on my nutch crawls on the fly(before it completes). I am
 asking this because the websites I am crawling are really huge and it takes
 around 3-4 days for a crawl to complete. I want to analyze some quick
 results while the nutch crawler is still crawling the URLs. Some one
 suggested me that Solr would make it possible.
 
  I followed the steps in
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By
 this process, I see only the injected URLs are shown in the Solr search. I
 know I did something really foolish and the crawl never happened, I feel I
 am missing some information here. I think somewhere in the process there
 should be a crawling happening and I missed it out.
 
  Just wanted to see if some one could help me pointing this out and where I
 went wrong in the process. Forgive my foolishness and thanks for your
 patience.
 
 Cheers,
 Abi

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Nutch and Solr search on the fly

2011-02-09 Thread .: Abhishek :.
Hi Markus,

 I am sorry for not being clear, I meant to say that...

 Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in
turn contain links to a.html, b.html, c.html, d.html) is injected into the
seed.txt, after the whole process I was expecting a bunch of other pages
which crawled from this seed url. However, at the end of it all I see is the
contents from only this page namely
www.somehost.com/gifts/greetingcard.htmland I do not see any other
pages(here a.html, b.html, c.html, d.html)
crawled from this one.

 The crawling happens only for the URLs mentioned in the seed.txt and does
not proceed further from there. So I am just bit confused. Why is it not
crawling the linked pages(a.html, b.html, c.html and d.html). I get a
feeling that I am missing something that the author of the blog(
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
everyone would know.

Thanks,
Abi


On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.iowrote:

 The parsed data is only sent to the Solr index of you tell a segment to be
 indexed; solrindex crawldb linkdb segment

 If you did this only once after injecting  and then the consequent
 fetch,parse,update,index sequence then you, of course, only see those
 URL's.
 If you don't index a segment after it's being parsed, you need to do it
 later
 on.

 On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
  Hi all,
 
   I am a newbie to nutch and solr. Well relatively much newer to Solr than
  Nutch :)
 
   I have been using nutch for past two weeks, and I wanted to know if I
 can
  query or search on my nutch crawls on the fly(before it completes). I am
  asking this because the websites I am crawling are really huge and it
 takes
  around 3-4 days for a crawl to complete. I want to analyze some quick
  results while the nutch crawler is still crawling the URLs. Some one
  suggested me that Solr would make it possible.
 
   I followed the steps in
  http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By
  this process, I see only the injected URLs are shown in the Solr search.
 I
  know I did something really foolish and the crawl never happened, I feel
 I
  am missing some information here. I think somewhere in the process there
  should be a crawling happening and I missed it out.
 
   Just wanted to see if some one could help me pointing this out and where
 I
  went wrong in the process. Forgive my foolishness and thanks for your
  patience.
 
  Cheers,
  Abi

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350



Re: Nutch and Solr search on the fly

2011-02-09 Thread Erick Erickson
WARNING: I don't do Nutch much, but could it be that your
crawl depth is 1? See:
http://wiki.apache.org/nutch/NutchTutorial

http://wiki.apache.org/nutch/NutchTutorialand search for depth
Best
Erick

On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com wrote:

 Hi Markus,

  I am sorry for not being clear, I meant to say that...

  Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in
 turn contain links to a.html, b.html, c.html, d.html) is injected into the
 seed.txt, after the whole process I was expecting a bunch of other pages
 which crawled from this seed url. However, at the end of it all I see is
 the
 contents from only this page namely
 www.somehost.com/gifts/greetingcard.htmland I do not see any other
 pages(here a.html, b.html, c.html, d.html)
 crawled from this one.

  The crawling happens only for the URLs mentioned in the seed.txt and does
 not proceed further from there. So I am just bit confused. Why is it not
 crawling the linked pages(a.html, b.html, c.html and d.html). I get a
 feeling that I am missing something that the author of the blog(
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
 everyone would know.

 Thanks,
 Abi


 On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.io
 wrote:

  The parsed data is only sent to the Solr index of you tell a segment to
 be
  indexed; solrindex crawldb linkdb segment
 
  If you did this only once after injecting  and then the consequent
  fetch,parse,update,index sequence then you, of course, only see those
  URL's.
  If you don't index a segment after it's being parsed, you need to do it
  later
  on.
 
  On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
   Hi all,
  
I am a newbie to nutch and solr. Well relatively much newer to Solr
 than
   Nutch :)
  
I have been using nutch for past two weeks, and I wanted to know if I
  can
   query or search on my nutch crawls on the fly(before it completes). I
 am
   asking this because the websites I am crawling are really huge and it
  takes
   around 3-4 days for a crawl to complete. I want to analyze some quick
   results while the nutch crawler is still crawling the URLs. Some one
   suggested me that Solr would make it possible.
  
I followed the steps in
   http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this.
 By
   this process, I see only the injected URLs are shown in the Solr
 search.
  I
   know I did something really foolish and the crawl never happened, I
 feel
  I
   am missing some information here. I think somewhere in the process
 there
   should be a crawling happening and I missed it out.
  
Just wanted to see if some one could help me pointing this out and
 where
  I
   went wrong in the process. Forgive my foolishness and thanks for your
   patience.
  
   Cheers,
   Abi
 
  --
  Markus Jelsma - CTO - Openindex
  http://www.linkedin.com/in/markus17
  050-8536620 / 06-50258350
 



Re: Nutch and Solr search on the fly

2011-02-09 Thread Markus Jelsma
Are you using the depth parameter with the crawl command or are you using the 
separate generate, fetch etc. commands?

What's $  nutch readdb crawldb -stats returning?

On Wednesday 09 February 2011 15:06:40 .: Abhishek :. wrote:
 Hi Markus,
 
  I am sorry for not being clear, I meant to say that...
 
  Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in
 turn contain links to a.html, b.html, c.html, d.html) is injected into the
 seed.txt, after the whole process I was expecting a bunch of other pages
 which crawled from this seed url. However, at the end of it all I see is
 the contents from only this page namely
 www.somehost.com/gifts/greetingcard.htmland I do not see any other
 pages(here a.html, b.html, c.html, d.html)
 crawled from this one.
 
  The crawling happens only for the URLs mentioned in the seed.txt and does
 not proceed further from there. So I am just bit confused. Why is it not
 crawling the linked pages(a.html, b.html, c.html and d.html). I get a
 feeling that I am missing something that the author of the blog(
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
 everyone would know.
 
 Thanks,
 Abi
 
 On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma 
markus.jel...@openindex.iowrote:
  The parsed data is only sent to the Solr index of you tell a segment to
  be indexed; solrindex crawldb linkdb segment
  
  If you did this only once after injecting  and then the consequent
  fetch,parse,update,index sequence then you, of course, only see those
  URL's.
  If you don't index a segment after it's being parsed, you need to do it
  later
  on.
  
  On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
   Hi all,
   
I am a newbie to nutch and solr. Well relatively much newer to Solr
than
   
   Nutch :)
   
I have been using nutch for past two weeks, and I wanted to know if I
  
  can
  
   query or search on my nutch crawls on the fly(before it completes). I
   am asking this because the websites I am crawling are really huge and
   it
  
  takes
  
   around 3-4 days for a crawl to complete. I want to analyze some quick
   results while the nutch crawler is still crawling the URLs. Some one
   suggested me that Solr would make it possible.
   
I followed the steps in
   
   http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this.
   By this process, I see only the injected URLs are shown in the Solr
   search.
  
  I
  
   know I did something really foolish and the crawl never happened, I
   feel
  
  I
  
   am missing some information here. I think somewhere in the process
   there should be a crawling happening and I missed it out.
   
Just wanted to see if some one could help me pointing this out and
where
  
  I
  
   went wrong in the process. Forgive my foolishness and thanks for your
   patience.
   
   Cheers,
   Abi
  
  --
  Markus Jelsma - CTO - Openindex
  http://www.linkedin.com/in/markus17
  050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Nutch and Solr search on the fly

2011-02-09 Thread .: Abhishek :.
Hi Erick,

 Thanks a bunch for the response

 Could be a chance..but all I am wondering is where to specify the depth in
the whole entire process in the URL
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried
specifying it during the fetcher phase but it was just ignored :(

Thanks,
Abi

On Wed, Feb 9, 2011 at 10:11 PM, Erick Erickson erickerick...@gmail.comwrote:

 WARNING: I don't do Nutch much, but could it be that your
 crawl depth is 1? See:
 http://wiki.apache.org/nutch/NutchTutorial

 http://wiki.apache.org/nutch/NutchTutorialand search for depth
 Best
 Erick

 On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com wrote:

  Hi Markus,
 
   I am sorry for not being clear, I meant to say that...
 
   Suppose if a url namely 
  www.somehost.com/gifts/greetingcard.html(whichhttp://www.somehost.com/gifts/greetingcard.html%28whichin
  turn contain links to a.html, b.html, c.html, d.html) is injected into
 the
  seed.txt, after the whole process I was expecting a bunch of other pages
  which crawled from this seed url. However, at the end of it all I see is
  the
  contents from only this page namely
  www.somehost.com/gifts/greetingcard.htmland I do not see any other
  pages(here a.html, b.html, c.html, d.html)
  crawled from this one.
 
   The crawling happens only for the URLs mentioned in the seed.txt and
 does
  not proceed further from there. So I am just bit confused. Why is it not
  crawling the linked pages(a.html, b.html, c.html and d.html). I get a
  feeling that I am missing something that the author of the blog(
  http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
  everyone would know.
 
  Thanks,
  Abi
 
 
  On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma 
 markus.jel...@openindex.io
  wrote:
 
   The parsed data is only sent to the Solr index of you tell a segment to
  be
   indexed; solrindex crawldb linkdb segment
  
   If you did this only once after injecting  and then the consequent
   fetch,parse,update,index sequence then you, of course, only see those
   URL's.
   If you don't index a segment after it's being parsed, you need to do it
   later
   on.
  
   On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
Hi all,
   
 I am a newbie to nutch and solr. Well relatively much newer to Solr
  than
Nutch :)
   
 I have been using nutch for past two weeks, and I wanted to know if
 I
   can
query or search on my nutch crawls on the fly(before it completes). I
  am
asking this because the websites I am crawling are really huge and it
   takes
around 3-4 days for a crawl to complete. I want to analyze some quick
results while the nutch crawler is still crawling the URLs. Some one
suggested me that Solr would make it possible.
   
 I followed the steps in
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for
 this.
  By
this process, I see only the injected URLs are shown in the Solr
  search.
   I
know I did something really foolish and the crawl never happened, I
  feel
   I
am missing some information here. I think somewhere in the process
  there
should be a crawling happening and I missed it out.
   
 Just wanted to see if some one could help me pointing this out and
  where
   I
went wrong in the process. Forgive my foolishness and thanks for your
patience.
   
Cheers,
Abi
  
   --
   Markus Jelsma - CTO - Openindex
   http://www.linkedin.com/in/markus17
   050-8536620 / 06-50258350
  
 



Re: Nutch and Solr search on the fly

2011-02-09 Thread charan kumar
Hi Abishek,

depth is a param of crawl command, not fetch command

If you are using custom script calling individual stages of nutch crawl,
then depth N means , you running that script for N times.. You can put a
loop, in the script.

Thanks,
Charan

On Wed, Feb 9, 2011 at 6:26 AM, .: Abhishek :. ab1s...@gmail.com wrote:

 Hi Erick,

  Thanks a bunch for the response

  Could be a chance..but all I am wondering is where to specify the depth in
 the whole entire process in the URL
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried
 specifying it during the fetcher phase but it was just ignored :(

 Thanks,
 Abi

 On Wed, Feb 9, 2011 at 10:11 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  WARNING: I don't do Nutch much, but could it be that your
  crawl depth is 1? See:
  http://wiki.apache.org/nutch/NutchTutorial
 
  http://wiki.apache.org/nutch/NutchTutorialand search for depth
  Best
  Erick
 
  On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com
 wrote:
 
   Hi Markus,
  
I am sorry for not being clear, I meant to say that...
  
Suppose if a url namely
 www.somehost.com/gifts/greetingcard.html(whichhttp://www.somehost.com/gifts/greetingcard.html%28which
 http://www.somehost.com/gifts/greetingcard.html%28whichin
   turn contain links to a.html, b.html, c.html, d.html) is injected into
  the
   seed.txt, after the whole process I was expecting a bunch of other
 pages
   which crawled from this seed url. However, at the end of it all I see
 is
   the
   contents from only this page namely
   www.somehost.com/gifts/greetingcard.htmland I do not see any other
   pages(here a.html, b.html, c.html, d.html)
   crawled from this one.
  
The crawling happens only for the URLs mentioned in the seed.txt and
  does
   not proceed further from there. So I am just bit confused. Why is it
 not
   crawling the linked pages(a.html, b.html, c.html and d.html). I get a
   feeling that I am missing something that the author of the blog(
   http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
   everyone would know.
  
   Thanks,
   Abi
  
  
   On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma 
  markus.jel...@openindex.io
   wrote:
  
The parsed data is only sent to the Solr index of you tell a segment
 to
   be
indexed; solrindex crawldb linkdb segment
   
If you did this only once after injecting  and then the consequent
fetch,parse,update,index sequence then you, of course, only see those
URL's.
If you don't index a segment after it's being parsed, you need to do
 it
later
on.
   
On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
 Hi all,

  I am a newbie to nutch and solr. Well relatively much newer to
 Solr
   than
 Nutch :)

  I have been using nutch for past two weeks, and I wanted to know
 if
  I
can
 query or search on my nutch crawls on the fly(before it completes).
 I
   am
 asking this because the websites I am crawling are really huge and
 it
takes
 around 3-4 days for a crawl to complete. I want to analyze some
 quick
 results while the nutch crawler is still crawling the URLs. Some
 one
 suggested me that Solr would make it possible.

  I followed the steps in
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for
  this.
   By
 this process, I see only the injected URLs are shown in the Solr
   search.
I
 know I did something really foolish and the crawl never happened, I
   feel
I
 am missing some information here. I think somewhere in the process
   there
 should be a crawling happening and I missed it out.

  Just wanted to see if some one could help me pointing this out and
   where
I
 went wrong in the process. Forgive my foolishness and thanks for
 your
 patience.

 Cheers,
 Abi
   
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
   
  
 



Re: Nutch and Solr search on the fly

2011-02-09 Thread .: Abhishek :.
Hi Charan,

 Thanks for the clarifications.

 The link I have been referring to(
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) does not say
anything about using the crawl? Do I have to do it after the  last step
mentioned?

Thanks,
Abi

On Thu, Feb 10, 2011 at 12:58 AM, charan kumar charan.ku...@gmail.comwrote:

 Hi Abishek,

 depth is a param of crawl command, not fetch command

 If you are using custom script calling individual stages of nutch crawl,
 then depth N means , you running that script for N times.. You can put a
 loop, in the script.

 Thanks,
 Charan

 On Wed, Feb 9, 2011 at 6:26 AM, .: Abhishek :. ab1s...@gmail.com wrote:

  Hi Erick,
 
   Thanks a bunch for the response
 
   Could be a chance..but all I am wondering is where to specify the depth
 in
  the whole entire process in the URL
  http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried
  specifying it during the fetcher phase but it was just ignored :(
 
  Thanks,
  Abi
 
  On Wed, Feb 9, 2011 at 10:11 PM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   WARNING: I don't do Nutch much, but could it be that your
   crawl depth is 1? See:
   http://wiki.apache.org/nutch/NutchTutorial
  
   http://wiki.apache.org/nutch/NutchTutorialand search for depth
   Best
   Erick
  
   On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com
  wrote:
  
Hi Markus,
   
 I am sorry for not being clear, I meant to say that...
   
 Suppose if a url namely
  www.somehost.com/gifts/greetingcard.html(whichhttp://www.somehost.com/gifts/greetingcard.html%28which
 http://www.somehost.com/gifts/greetingcard.html%28which
  http://www.somehost.com/gifts/greetingcard.html%28whichin
turn contain links to a.html, b.html, c.html, d.html) is injected
 into
   the
seed.txt, after the whole process I was expecting a bunch of other
  pages
which crawled from this seed url. However, at the end of it all I see
  is
the
contents from only this page namely
www.somehost.com/gifts/greetingcard.htmland I do not see any other
pages(here a.html, b.html, c.html, d.html)
crawled from this one.
   
 The crawling happens only for the URLs mentioned in the seed.txt and
   does
not proceed further from there. So I am just bit confused. Why is it
  not
crawling the linked pages(a.html, b.html, c.html and d.html). I get a
feeling that I am missing something that the author of the blog(
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
everyone would know.
   
Thanks,
Abi
   
   
On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma 
   markus.jel...@openindex.io
wrote:
   
 The parsed data is only sent to the Solr index of you tell a
 segment
  to
be
 indexed; solrindex crawldb linkdb segment

 If you did this only once after injecting  and then the consequent
 fetch,parse,update,index sequence then you, of course, only see
 those
 URL's.
 If you don't index a segment after it's being parsed, you need to
 do
  it
 later
 on.

 On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
  Hi all,
 
   I am a newbie to nutch and solr. Well relatively much newer to
  Solr
than
  Nutch :)
 
   I have been using nutch for past two weeks, and I wanted to know
  if
   I
 can
  query or search on my nutch crawls on the fly(before it
 completes).
  I
am
  asking this because the websites I am crawling are really huge
 and
  it
 takes
  around 3-4 days for a crawl to complete. I want to analyze some
  quick
  results while the nutch crawler is still crawling the URLs. Some
  one
  suggested me that Solr would make it possible.
 
   I followed the steps in
  http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for
   this.
By
  this process, I see only the injected URLs are shown in the Solr
search.
 I
  know I did something really foolish and the crawl never happened,
 I
feel
 I
  am missing some information here. I think somewhere in the
 process
there
  should be a crawling happening and I missed it out.
 
   Just wanted to see if some one could help me pointing this out
 and
where
 I
  went wrong in the process. Forgive my foolishness and thanks for
  your
  patience.
 
  Cheers,
  Abi

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350

   
  
 



Re: [Nutch] and Solr integration

2011-01-03 Thread Adam Estrada
All,

I realize that the documentation says that you crawl first then add to Solr
but I spent several hours running the same command through Cygwin with
-solrindex http://localhost:8983/solr on the command line (eg. bin/nutch
crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex
http://localhost:8983/solr) and it worked. Does anyone know why it's not
working for me anymore? I am using the Lucid build of Solr which was what i
was using before. I neglected to write down the command line syntax which is
biting me in the arse. Any tips on this one would be great!

Thanks,
Adam

On Mon, Dec 20, 2010 at 4:21 PM, Anurag anurag.it.jo...@gmail.com wrote:


 why are using solrindex in the argument.? It is used when we need to index
 the crawled data in Solr
 For more read http://wiki.apache.org/nutch/NutchTutorial .

 Also for nutch-solr integration this is very useful blog
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
 I integrated nutch and solr and it works well.

 Thanks

 On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] 
 ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com
 ml-node%2b2122347-622655030-146...@n3.nabble.comml-node%252b2122347-622655030-146...@n3.nabble.com
 
  wrote:

  All,
 
  I have a couple websites that I need to crawl and the following command
  line
  used to work I think. Solr is up and running and everything is fine there
  and I can go through and index the site but I really need the results
 added
 
  to Solr after the crawl. Does anyone have any idea on how to make that
  happen or what I'm doing wrong?  These errors are being thrown fro Hadoop
  which I am not using at all.
 
  $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50
  -solrindex
  ht
  tp://localhost:8983/solr
  crawl started in: crawl
  rootUrlDir = http://localhost:8983/solr
  threads = 10
  depth = 100
  indexer=lucene
  topN = 50
  Injector: starting at 2010-12-20 15:23:25
  Injector: crawlDb: crawl/crawldb
  Injector: urlDir: http://localhost:8983/solr
  Injector: Converting injected urls to crawl db entries.
  Exception in thread main java.io.IOException: No FileSystem for scheme:
  http
  at
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375
  )
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
  at
 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
  at
  org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j
  ava:169)
  at
  org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja
  va:201)
  at
  org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
 
  at
  org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
  81)
  at
 org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
 
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
  at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
 
 
  --
   View message @
 
 http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html
  To start a new topic under Solr - User, email
  ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com
 ml-node%2b472068-1941297125-146...@n3.nabble.comml-node%252b472068-1941297125-146...@n3.nabble.com
 
  To unsubscribe from Solr - User, click here
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY=
 .
 
 



 --
 Kumar Anurag


 -
 Kumar Anurag

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: [Nutch] and Solr integration

2011-01-03 Thread Adam Estrada
BLEH! facepalm This is entirely possible to do in a single step AS LONG AS
YOU GET THE SYNTAX CORRECT ;-)

http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/

http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/bin/nutch
crawl urls -dir crawl -threads 10 -depth 100 -topN 50* -solr*
http://localhost:8983/solr

http://localhost:8983/solrThe correct param is -solr NOT -solrindex.

Cheers,
Adam

On Mon, Jan 3, 2011 at 11:45 AM, Adam Estrada estrada.a...@gmail.comwrote:

 All,

 I realize that the documentation says that you crawl first then add to Solr
 but I spent several hours running the same command through Cygwin with
 -solrindex http://localhost:8983/solr on the command line (eg. bin/nutch
 crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex
 http://localhost:8983/solr) and it worked. Does anyone know why it's not
 working for me anymore? I am using the Lucid build of Solr which was what i
 was using before. I neglected to write down the command line syntax which is
 biting me in the arse. Any tips on this one would be great!

 Thanks,
 Adam

 On Mon, Dec 20, 2010 at 4:21 PM, Anurag anurag.it.jo...@gmail.com wrote:


 why are using solrindex in the argument.? It is used when we need to index
 the crawled data in Solr
 For more read http://wiki.apache.org/nutch/NutchTutorial .

 Also for nutch-solr integration this is very useful blog
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
 I integrated nutch and solr and it works well.

 Thanks

 On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] 
 ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com
 ml-node%2b2122347-622655030-146...@n3.nabble.comml-node%252b2122347-622655030-146...@n3.nabble.com
 
  wrote:

  All,
 
  I have a couple websites that I need to crawl and the following command
  line
  used to work I think. Solr is up and running and everything is fine
 there
  and I can go through and index the site but I really need the results
 added
 
  to Solr after the crawl. Does anyone have any idea on how to make that
  happen or what I'm doing wrong?  These errors are being thrown fro
 Hadoop
  which I am not using at all.
 
  $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50
  -solrindex
  ht
  tp://localhost:8983/solr
  crawl started in: crawl
  rootUrlDir = http://localhost:8983/solr
  threads = 10
  depth = 100
  indexer=lucene
  topN = 50
  Injector: starting at 2010-12-20 15:23:25
  Injector: crawlDb: crawl/crawldb
  Injector: urlDir: http://localhost:8983/solr
  Injector: Converting injected urls to crawl db entries.
  Exception in thread main java.io.IOException: No FileSystem for
 scheme:
  http
  at
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375
  )
  at
 org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
  at
 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
  at
  org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j
  ava:169)
  at
  org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja
  va:201)
  at
  org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
 
  at
  org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
  81)
  at
 org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
 
  at
 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
  at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
 
 
  --
   View message @
 
 http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html
  To start a new topic under Solr - User, email
  ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com
 ml-node%2b472068-1941297125-146...@n3.nabble.comml-node%252b472068-1941297125-146...@n3.nabble.com
 
  To unsubscribe from Solr - User, click here
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY=
 .
 
 



 --
 Kumar Anurag


 -
 Kumar Anurag

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html
 Sent from the Solr - User mailing list archive at Nabble.com.





Re: [Nutch] and Solr integration

2010-12-20 Thread Anurag

why are using solrindex in the argument.? It is used when we need to index
the crawled data in Solr
For more read http://wiki.apache.org/nutch/NutchTutorial .

Also for nutch-solr integration this is very useful blog
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
I integrated nutch and solr and it works well.

Thanks

On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] 
ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com
 wrote:

 All,

 I have a couple websites that I need to crawl and the following command
 line
 used to work I think. Solr is up and running and everything is fine there
 and I can go through and index the site but I really need the results added

 to Solr after the crawl. Does anyone have any idea on how to make that
 happen or what I'm doing wrong?  These errors are being thrown fro Hadoop
 which I am not using at all.

 $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50
 -solrindex
 ht
 tp://localhost:8983/solr
 crawl started in: crawl
 rootUrlDir = http://localhost:8983/solr
 threads = 10
 depth = 100
 indexer=lucene
 topN = 50
 Injector: starting at 2010-12-20 15:23:25
 Injector: crawlDb: crawl/crawldb
 Injector: urlDir: http://localhost:8983/solr
 Injector: Converting injected urls to crawl db entries.
 Exception in thread main java.io.IOException: No FileSystem for scheme:
 http
 at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375
 )
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j
 ava:169)
 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja
 va:201)
 at
 org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)

 at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
 81)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)

 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
 at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)


 --
  View message @
 http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html
 To start a new topic under Solr - User, email
 ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com
 To unsubscribe from Solr - User, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY=.





-- 
Kumar Anurag


-
Kumar Anurag

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [Nutch] and Solr integration

2010-12-20 Thread Adam Estrada
bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex
http://localhost:8983/solr

I've run that command before and it worked...that's why I asked.

grab nutch from trunk and run bin/nutch and see that it is in fact an
option. It looks like Hadoop is the culprit now and I am at a loss on how to
fix it.

Thanks for the feedback.
Adam

On Mon, Dec 20, 2010 at 4:21 PM, Anurag anurag.it.jo...@gmail.com wrote:


 why are using solrindex in the argument.? It is used when we need to index
 the crawled data in Solr
 For more read http://wiki.apache.org/nutch/NutchTutorial .

 Also for nutch-solr integration this is very useful blog
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
 I integrated nutch and solr and it works well.

 Thanks

 On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] 
 ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com
 ml-node%2b2122347-622655030-146...@n3.nabble.comml-node%252b2122347-622655030-146...@n3.nabble.com
 
  wrote:

  All,
 
  I have a couple websites that I need to crawl and the following command
  line
  used to work I think. Solr is up and running and everything is fine there
  and I can go through and index the site but I really need the results
 added
 
  to Solr after the crawl. Does anyone have any idea on how to make that
  happen or what I'm doing wrong?  These errors are being thrown fro Hadoop
  which I am not using at all.
 
  $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50
  -solrindex
  ht
  tp://localhost:8983/solr
  crawl started in: crawl
  rootUrlDir = http://localhost:8983/solr
  threads = 10
  depth = 100
  indexer=lucene
  topN = 50
  Injector: starting at 2010-12-20 15:23:25
  Injector: crawlDb: crawl/crawldb
  Injector: urlDir: http://localhost:8983/solr
  Injector: Converting injected urls to crawl db entries.
  Exception in thread main java.io.IOException: No FileSystem for scheme:
  http
  at
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375
  )
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
  at
 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
  at
  org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j
  ava:169)
  at
  org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja
  va:201)
  at
  org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
 
  at
  org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
  81)
  at
 org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
 
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
  at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
 
 
  --
   View message @
 
 http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html
  To start a new topic under Solr - User, email
  ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com
 ml-node%2b472068-1941297125-146...@n3.nabble.comml-node%252b472068-1941297125-146...@n3.nabble.com
 
  To unsubscribe from Solr - User, click here
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY=
 .
 
 



 --
 Kumar Anurag


 -
 Kumar Anurag

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Nutch with SOLR

2007-09-26 Thread Doğacan Güney
On 9/26/07, Brian Whitman [EMAIL PROTECTED] wrote:

  Sami has a patch in there which used a older version of the solr
  client. with the current solr client in the SVN tree, his patch
  becomes much easier.
  your job would be to upgrade the patch and mail it back to him so
  he can update his blog, or post it as a patch for inclusion in
  nutch/contrib (if sami is ok with that). If you have issues with
  how to use the solr client api, solr-user is here to help.
 

 I've done this. Apparently someone else has taken on the solr-nutch
 job and made it a bit more complicated (which is good for the long
 term) than sami's original patch -- https://issues.apache.org/jira/
 browse/NUTCH-442

That someone else is me :)

NUTCH-442 is one of the issues that I want to really see resolved.
Unfortunately, I haven't received many (as in, none) comments, so I
haven't made further progress on it.

Patch at NUTCH-442 tries to integrate SOLR in a way that it is a
first-class citizen (so to speak), so that you can index to solr or
to lucene within the same Indexer job (or both), retrieve search
results from a solr server or from nutch's home-grown index servers in
nutch's web UI (or a combination of both). And I think patch lays the
ground work for generating summaries from solr.


 But we still use a version of Sami's patch that works on both trunk
 nutch and trunk solr (solrj.) I sent my changes to sami when we did
 it, if you need it let me know...


 -b





-- 
Doğacan Güney


Re: Nutch with SOLR

2007-09-26 Thread Brian Whitman


On Sep 26, 2007, at 4:04 AM, Doğacan Güney wrote:


NUTCH-442 is one of the issues that I want to really see resolved.
Unfortunately, I haven't received many (as in, none) comments, so I
haven't made further progress on it.



I am probably your target customer but to be honest all we care about  
is using Solr to index, not for any of the searching or summary stuff  
in Nutch. Is there a way to get Sami's SolrIndexer in nutch trunk  
(now that it's working OK) sooner than later and keep working on  
NUTCH-442 as well? Do they conflict? -b





Re: Nutch with SOLR

2007-09-25 Thread Ian Holsman
[moving this thread to solr-user, as it really has nothing to do with 
hadoop]


Daniel Clark wrote:

There's info on website
http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.htm
l, but it's not clear.

  


Sami has a patch in there which used a older version of the solr client. 
with the current solr client in the SVN tree, his patch becomes much easier.
your job would be to upgrade the patch and mail it back to him so he can 
update his blog, or post it as a patch for inclusion in nutch/contrib 
(if sami is ok with that). If you have issues with how to use the solr 
client api, solr-user is here to help.


the nutch specific changes are:
1. configure nutch-site.xml to add a config option to point to your solr 
server.


2. instead of calling the nutch 'index' command, you would call it like so
bin/nutch org.apache.nutch.indexer.SolrIndexer $BASEDIR/crawldb 
$BASEDIR/linkdb $SEGMENT



regards
Ian



~
Daniel Clark, President
DAC Systems, Inc.
(703) 403-0340
~

-Original Message-
From: Dmitry [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 25, 2007 2:56 PM

To: [EMAIL PROTECTED]
Subject: Re: Nutch with SOLR

Daniel,

We just started to test/research posibility of integration of Nutch and Solr

so it will be nice to hear any advices as well.

Thanks,
DT
www.ejizn.com

- Original Message - 
From: Daniel Clark [EMAIL PROTECTED]

To: [EMAIL PROTECTED]
Sent: Tuesday, September 25, 2007 1:23 PM
Subject: Nutch with SOLR


  
Has anyone been able to get Nutch 0.9 working with SOLR?  Any help would 
be

appreciated.



~

Daniel Clark, President

DAC Systems, Inc.

(703) 403-0340

~









  




Re: Nutch with SOLR

2007-09-25 Thread Brian Whitman


Sami has a patch in there which used a older version of the solr  
client. with the current solr client in the SVN tree, his patch  
becomes much easier.
your job would be to upgrade the patch and mail it back to him so  
he can update his blog, or post it as a patch for inclusion in  
nutch/contrib (if sami is ok with that). If you have issues with  
how to use the solr client api, solr-user is here to help.




I've done this. Apparently someone else has taken on the solr-nutch  
job and made it a bit more complicated (which is good for the long  
term) than sami's original patch -- https://issues.apache.org/jira/ 
browse/NUTCH-442


But we still use a version of Sami's patch that works on both trunk  
nutch and trunk solr (solrj.) I sent my changes to sami when we did  
it, if you need it let me know...



-b




Re: Nutch with SOLR

2007-09-25 Thread Brian Whitman


But we still use a version of Sami's patch that works on both trunk  
nutch and trunk solr (solrj.) I sent my changes to sami when we did  
it, if you need it let me know...




I put my files up here: http://variogr.am/latest/?p=26

-b



Re: Nutch with SOLR

2007-09-25 Thread Ian Holsman

Thanks Brian.
I'm sure this will help lots of people.

Brian Whitman wrote:


But we still use a version of Sami's patch that works on both trunk 
nutch and trunk solr (solrj.) I sent my changes to sami when we did 
it, if you need it let me know...




I put my files up here: http://variogr.am/latest/?p=26

-b