Re: question about ObjectCache

2012-04-10 Thread Andrzej Bialecki
. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Drawing an analogy between AdaptiveFetchSchedule and AdaptiveCrawlDelay

2012-03-02 Thread Andrzej Bialecki
shine some light on what happened to Fetcher2.java that Dogacan refers to? I was only ever accustomed to OldFetcher and Fetcher :0) Fetcher2 is the current Fetcher. The original Fetcher was temporarily renamed OldFetcher and then removed. -- Best regards, Andrzej Bialecki

[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls

2012-01-17 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187927#comment-13187927 ] Andrzej Bialecki commented on NUTCH-1201: -- I agree that there are situations

[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

2012-01-14 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186212#comment-13186212 ] Andrzej Bialecki commented on NUTCH-1247: -- Indeed, line 264 increases the retry

[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

2012-01-13 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185908#comment-13185908 ] Andrzej Bialecki commented on NUTCH-1247: -- Originally the reason for a byte

Re: Build failed in Jenkins: Nutch-trunk #1706

2011-12-28 Thread Andrzej Bialecki
/ + dependency org=junit name=junit rev=3.8.1 conf=*-default / dependency org=org.apache.hadoop name=hadoop-test rev=0.20.205.0 conf=test-default / -- Best regards, Andrzej Bialecki

Re: Build failed in Jenkins: Nutch-trunk #1706

2011-12-28 Thread Andrzej Bialecki
on the classpath and masks the issue? this happened to me once or twice... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-15 Thread Andrzej Bialecki
the other ones are easy to convert, too... I'm bogged with other work now, but I'll see if I can prepare an example later today... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-14 Thread Andrzej Bialecki
is in org.apache.hadoop.mapreduce.lib.output . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-14 Thread Andrzej Bialecki
org.apache.hadoop.mapred.MapFileOutputFormat still uses the old api, and it's not deprecated yet. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: Upgrading to Hadoop 0.22.0+

2011-12-13 Thread Andrzej Bialecki
drop in the 0.22 jars and see if it compiles / tests are passing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Upgrading to Hadoop 0.22.0+

2011-12-13 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

[jira] [Resolved] (NUTCH-1213) Pass additional SolrParams when indexing to Solr

2011-11-28 Thread Andrzej Bialecki (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-1213. -- Resolution: Fixed Committed in rev. 1207217, thanks for the review

[jira] [Updated] (NUTCH-1213) Pass additional SolrParams when indexing to Solr

2011-11-25 Thread Andrzej Bialecki (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-1213: - Attachment: NUTCH-1213.diff Path that implements this functionality. SolrParams can

[jira] [Issue Comment Edited] (NUTCH-1213) Pass additional SolrParams when indexing to Solr

2011-11-25 Thread Andrzej Bialecki (Issue Comment Edited) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157077#comment-13157077 ] Andrzej Bialecki edited comment on NUTCH-1213 at 11/25/11 10:26 AM

Re: Dependency Injection

2011-11-23 Thread Andrzej Bialecki
On 23/11/2011 01:02, Andrzej Bialecki wrote: On 22/11/2011 19:47, PJ Herring wrote: Hey Chris, Thanks for the response. I looked at the documents you sent me, and I really do think incorporating some kind of DI Framework could be a great addition to Nutch. I have a general plan of attack

Re: Dependency Injection

2011-11-22 Thread Andrzej Bialecki
supposed to run ... so at that time we didn't think this complication was justified. If we can figure out something between full-blown OSGI and the current system then that would be great. -- Best regards, Andrzej Bialecki

Re: Signature == null ?

2011-11-15 Thread Andrzej Bialecki
in CrawlDbReducer... Do you notice any pattern to these pages? What's their origin? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: Persistent problems with Ivy dependencies in Eclipse

2011-11-10 Thread Andrzej Bialecki
porting of a pure ant build to an ant+ivy build. We should determine what deps are really needed by these plugins, and sanitize the ivy.xml files so that they make sense - if the existing files can't be untangled we can ditch them and come up with new, clean ones. -- Best regards, Andrzej

[jira] [Commented] (NUTCH-1139) Indexer to delete documents

2011-11-10 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147722#comment-13147722 ] Andrzej Bialecki commented on NUTCH-1139: -- I suggest renaming the option

[jira] [Commented] (NUTCH-1061) Migrate MoreIndexingFilter from Apache ORO to java.util.regex

2011-11-10 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147723#comment-13147723 ] Andrzej Bialecki commented on NUTCH-1061: -- +1. Migrate

Re: Nutch Maven artifacts now published as polled/nightly SNAPSHOTS

2011-11-05 Thread Andrzej Bialecki
, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

[jira] [Commented] (NUTCH-1196) Update job should impose an upper limit on the number of inlinks (nutchgora)

2011-11-04 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144226#comment-13144226 ] Andrzej Bialecki commented on NUTCH-1196: -- Very nicely done and useful patch

[jira] [Resolved] (NUTCH-1195) Add Solr 4x (trunk) example schema

2011-11-03 Thread Andrzej Bialecki (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-1195. -- Resolution: Fixed Committed in rev. 1197319. Add Solr 4x (trunk

[jira] [Created] (NUTCH-1197) Add statically configured field values to solrindex-mapping.xml

2011-11-03 Thread Andrzej Bialecki (Created) (JIRA)
Components: indexer Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.4 In some cases it's useful to be able to add to every document sent to Solr a set of predefined fields with static values. This could be implemented on the Solr

[jira] [Updated] (NUTCH-1197) Add statically configured field values to solrindex-mapping.xml

2011-11-03 Thread Andrzej Bialecki (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-1197: - Attachment: NUTCH-1197.patch Patch with the implementation. I added some javadocs

[jira] [Created] (NUTCH-1195) Add Solr 4x (trunk) example schema

2011-11-02 Thread Andrzej Bialecki (Created) (JIRA)
Bialecki Assignee: Andrzej Bialecki Fix For: 1.4 The conf/schema.xml that we ship works ok for Solr 3.x, but in Solr trunk some of the class names have been changed, and some field types have been redefined, so if you simply drop this schema into Solr it will cause

[jira] [Updated] (NUTCH-1195) Add Solr 4x (trunk) example schema

2011-11-02 Thread Andrzej Bialecki (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-1195: - Attachment: schema-solr4.xml Add Solr 4x (trunk) example schema

[jira] [Commented] (NUTCH-1135) Fix TestGoraStorage for Nutchgora

2011-10-14 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13127427#comment-13127427 ] Andrzej Bialecki commented on NUTCH-1135: -- A few comments from the author

[jira] [Commented] (NUTCH-1135) Fix TestGoraStorage for Nutchgora

2011-10-14 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13127470#comment-13127470 ] Andrzej Bialecki commented on NUTCH-1135: -- bq. if you prefer to keep the old

[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-12 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125712#comment-13125712 ] Andrzej Bialecki commented on NUTCH-797: - That's unexpected :) I checked the patch

Re: [jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-12 Thread Andrzej Bialecki
, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-12 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125916#comment-13125916 ] Andrzej Bialecki commented on NUTCH-1097: -- +1, the latest patch looks good

[jira] [Commented] (NUTCH-1142) Normalization and filtering in WebGraph

2011-10-12 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125931#comment-13125931 ] Andrzej Bialecki commented on NUTCH-1142: -- +1, the patch looks good

[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-11 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13124737#comment-13124737 ] Andrzej Bialecki commented on NUTCH-797: - The fixup code in Tika is still

[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-11 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125016#comment-13125016 ] Andrzej Bialecki commented on NUTCH-797: - Uhh, sorry - I'll fix this in a moment

[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-11 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125077#comment-13125077 ] Andrzej Bialecki commented on NUTCH-797: - I'm puzzled by the algorithm

[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-11 Thread Andrzej Bialecki (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-797: Attachment: NUTCH-797.patch Tentative patch, which changes the meaning of fixEmbeddedParams

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-11 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125414#comment-13125414 ] Andrzej Bialecki commented on NUTCH-1097: -- +1 the idea makes sense. Patch looks

[jira] [Commented] (NUTCH-1154) Upgrade to Tika 0.10

2011-10-10 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13124428#comment-13124428 ] Andrzej Bialecki commented on NUTCH-1154: -- TIKA-748 has been fixed

[jira] [Created] (NUTCH-1152) Upgrade to SolrJ 3.4.0

2011-10-07 Thread Andrzej Bialecki (Created) (JIRA)
Upgrade to SolrJ 3.4.0 -- Key: NUTCH-1152 URL: https://issues.apache.org/jira/browse/NUTCH-1152 Project: Nutch Issue Type: Improvement Reporter: Andrzej Bialecki Fix For: 1.4 Current release

[jira] [Resolved] (NUTCH-1152) Upgrade to SolrJ 3.4.0

2011-10-07 Thread Andrzej Bialecki (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-1152. -- Resolution: Fixed Assignee: Andrzej Bialecki Committed in rev. 1180087

[jira] [Created] (NUTCH-1154) Upgrade to Tika 0.10

2011-10-07 Thread Andrzej Bialecki (Created) (JIRA)
Upgrade to Tika 0.10 Key: NUTCH-1154 URL: https://issues.apache.org/jira/browse/NUTCH-1154 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.4 Reporter: Andrzej

[jira] [Updated] (NUTCH-1154) Upgrade to Tika 0.10

2011-10-07 Thread Andrzej Bialecki (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-1154: - Attachment: NUTCH-1154.diff Patch to upgrade to Tika 0.10. Unfortunately, TestRTFParser

[jira] [Commented] (NUTCH-1124) JUnit test for scoring-opic

2011-10-05 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120982#comment-13120982 ] Andrzej Bialecki commented on NUTCH-1124: -- Our implementation is most definitely

Re: [VOTE] Move 2.0 out of trunk

2011-09-20 Thread Andrzej Bialecki
for a usable platform, and continue redesign from that codebase. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script

2011-08-23 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089405#comment-13089405 ] Andrzej Bialecki commented on NUTCH-1087: -- IIRC we had this discussion

[jira] [Commented] (NUTCH-1014) Migrate from Apache ORO to java.util.regex

2011-07-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067972#comment-13067972 ] Andrzej Bialecki commented on NUTCH-1014: -- java.util.regex has the advantage

[jira] [Commented] (NUTCH-985) MoreIndexingFilter doesn't use properly formatted date fields for Solr

2011-05-17 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034724#comment-13034724 ] Andrzej Bialecki commented on NUTCH-985: - We should use the Solr's DateUtil in all

Re: Differences 1.x and trunk

2011-03-18 Thread Andrzej Bialecki
to specify what fields from WebPage you are interested in (and only these fields will be pulled in from the storage). This is all handled by StorageUtils methods. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: [jira] Closed: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-03-10 Thread Andrzej Bialecki
improvements from 2.0 have been backported into 1.3 now The only remaining issue to address before rolling out a 1.3 release is NUTCH-914 Implement Apache Project Branding Requirements (and subtasks...) -- Best regards, Andrzej Bialecki

[jira] Resolved: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-03-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-951. - Resolution: Fixed Backport changes from 2.0 into 1.3

[jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-03-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13004488#comment-13004488 ] Andrzej Bialecki commented on NUTCH-951: - * Ported NUTCH-872 in rev. 1079746

[jira] Resolved: (NUTCH-962) max. redirects not handled correctly: fetcher stops at max-1 redirects

2011-03-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-962. - Resolution: Fixed Fix Version/s: 2.0 1.3 Assignee

[jira] Resolved: (NUTCH-955) Ivy configuration

2011-03-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-955. - Resolution: Fixed Fix Version/s: 2.0 Assignee: Andrzej Bialecki Ivy

Gora/HBase dependencies and deploy artifacts

2011-02-24 Thread Andrzej Bialecki
how to deploy Gora backend implementations so that they work with Nutch and with a generic unmodified Hadoop cluster. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

[jira] Resolved: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments

2010-12-21 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-939. - Resolution: Fixed Assignee: Andrzej Bialecki I modified the patch slightly

[jira] Created: (NUTCH-948) Remove Lucene dependencies

2010-12-21 Thread Andrzej Bialecki (JIRA)
Remove Lucene dependencies -- Key: NUTCH-948 URL: https://issues.apache.org/jira/browse/NUTCH-948 Project: Nutch Issue Type: Improvement Affects Versions: 1.3 Reporter: Andrzej Bialecki

[jira] Resolved: (NUTCH-948) Remove Lucene dependencies

2010-12-21 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-948. - Resolution: Fixed Committed in rev. 1051509. Remove Lucene dependencies

[jira] Commented: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments

2010-12-21 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973915#action_12973915 ] Andrzej Bialecki commented on NUTCH-939: - 1.2 release is out, and branch-1.2

Re: Does Nutch 2.0 in good enough shape to test?

2010-12-17 Thread Andrzej Bialecki
bug. - wait for Hadoop job completion in the Fetcher job I missed your previous email... I'll fix this shortly - thanks for spotting it. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

Re: Java.io.IOException with multiple copyField/ directives

2010-12-03 Thread Andrzej Bialecki
that the field type declared in your schema.xml is not multiValued. What was the exception? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

[jira] Commented: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments

2010-11-26 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12936047#action_12936047 ] Andrzej Bialecki commented on NUTCH-939: - Please note that trunk uses a very

[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-25 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: NUTCH-932-4.patch Final version of the patch. Bulk REST API to retrieve crawl

[jira] Resolved: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-25 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-932. - Resolution: Fixed Fix Version/s: 2.0 Committed in rev. 1039014. Bulk REST API

[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-12 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: NUTCH-932-3.patch NutchTool is an abstract class in this patch. This actually

[jira] Commented: (NUTCH-880) REST API for Nutch

2010-11-05 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12928909#action_12928909 ] Andrzej Bialecki commented on NUTCH-880: - Thanks - this issue is already fixed

[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-04 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: NUTCH-932.patch This patch adds bulk retrieval of crawl results. This is still

[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-04 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: db.formatted.gz Example DB content (this was passed through a JSON pretty

[jira] Commented: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-04 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12928355#action_12928355 ] Andrzej Bialecki commented on NUTCH-932: - Examples (with the db equivalent

[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-04 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: NUTCH-932.patch Updated patch - this recognizes now URL parameters

[jira] Resolved: (NUTCH-931) Simple admin API to fetch status and stop the service

2010-10-29 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-931. - Resolution: Fixed Committed in rev. 1028736 with some changes. Simple admin API

[jira] Updated: (NUTCH-880) REST API for Nutch

2010-10-28 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-880: Summary: REST API for Nutch (was: REST API (and webapp) for Nutch) The webapp part

[jira] Resolved: (NUTCH-880) REST API for Nutch

2010-10-28 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-880. - Resolution: Fixed Fix Version/s: 2.0 Committed in rev. 1028235. The webapp part

[jira] Created: (NUTCH-930) Remove remaining dependencies on Lucene API

2010-10-28 Thread Andrzej Bialecki (JIRA)
Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Nutch doesn't use Lucene API anymore, all indexing happens via Lucene-agnostic SolrJ API. The only place where we still use a minor part of Lucene is in index-basic, and that use (DateTools) can be easily replaced. -- This message

[jira] Updated: (NUTCH-930) Remove remaining dependencies on Lucene API

2010-10-28 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-930: Attachment: NUTCH-930.patch Patch to fix the issue. I'll commit this shortly. Remove

[jira] Resolved: (NUTCH-930) Remove remaining dependencies on Lucene API

2010-10-28 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-930. - Resolution: Fixed Fix Version/s: 2.0 Committed in rev. 1028474. Remove remaining

[jira] Created: (NUTCH-931) Simple admin API to fetch status and stop the service

2010-10-28 Thread Andrzej Bialecki (JIRA)
: REST_api Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 REST API needs a simple info / stats service and the ability to shutdown the server. -- This message is automatically generated by JIRA. - You can reply

[jira] Commented: (NUTCH-926) Nutch follows wrong url in META http-equiv=refresh tag

2010-10-27 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925543#action_12925543 ] Andrzej Bialecki commented on NUTCH-926: - bq. Nutch continues to crawl the WRONG

Re: ReviewBoard Instance

2010-10-26 Thread Andrzej Bialecki
... but what's the point of using the tool in our JIRA-based workflow? It looks to me like it duplicates at least part of JIRA's functionality, and the remaining part is what we do also in JIRA by convention... -- Best regards, Andrzej Bialecki

[jira] Commented: (NUTCH-913) Nutch should use new namespace for Gora

2010-10-25 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12924659#action_12924659 ] Andrzej Bialecki commented on NUTCH-913: - +1, let's commit it - I want to start

[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

2010-10-23 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12924154#action_12924154 ] Andrzej Bialecki commented on NUTCH-923: - This doesn't solve the problem

[jira] Commented: (NUTCH-924) Static field in solr mapping

2010-10-22 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923845#action_12923845 ] Andrzej Bialecki commented on NUTCH-924: - The functionality is useful, +1

[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

2010-10-22 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923896#action_12923896 ] Andrzej Bialecki commented on NUTCH-923: - This sounds useful, though

Re: Build failed in Hudson: Nutch-trunk #1280

2010-10-19 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

[jira] Updated: (NUTCH-921) Reduce dependency of Nutch on config files

2010-10-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-921: Attachment: NUTCH-921.patch Patch that implements reading config parameters from

[jira] Commented: (NUTCH-913) Nutch should use new namespace for Gora

2010-10-13 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920610#action_12920610 ] Andrzej Bialecki commented on NUTCH-913: - There are formatting issues

[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-10-01 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916870#action_12916870 ] Andrzej Bialecki commented on NUTCH-907: - Hi Sertan, Thanks for the patch

[jira] Commented: (NUTCH-882) Design a Host table in GORA

2010-10-01 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916874#action_12916874 ] Andrzej Bialecki commented on NUTCH-882: - Doğacan, I missed your previous comment

[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0

2010-10-01 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916912#action_12916912 ] Andrzej Bialecki commented on NUTCH-864: - I think the difficulty comes from

Re: [VOTE] Apache Nutch 1.2 Release Candidate #4

2010-09-24 Thread Andrzej Bialecki
tomorrow. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [VOTE] Apache Nutch 1.2 Release Candidate #4

2010-09-24 Thread Andrzej Bialecki
+ indexing to Solr went just fine. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot

[jira] Commented: (NUTCH-880) REST API (and webapp) for Nutch

2010-09-21 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913118#action_12913118 ] Andrzej Bialecki commented on NUTCH-880: - bq. I think we can combine the approach

[jira] Commented: (NUTCH-909) Add alternative search-provider to Nutch site

2010-09-20 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912474#action_12912474 ] Andrzej Bialecki commented on NUTCH-909: - bq. It might be better to see the message

[jira] Assigned: (NUTCH-862) HttpClient null pointer exception

2010-09-17 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reassigned NUTCH-862: --- Assignee: Andrzej Bialecki HttpClient null pointer exception

[jira] Resolved: (NUTCH-906) Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names not being valid XML tag names

2010-09-17 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-906. - Fix Version/s: 1.2 Resolution: Fixed Fixed in rev. 998261. Thanks! Nutch

[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-09-16 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910109#action_12910109 ] Andrzej Bialecki commented on NUTCH-907: - That's very good news - in that case I'm

[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch

2010-09-16 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-880: Attachment: API.patch Initial patch for discussion. This is a work in progress, so only

[jira] Created: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-09-15 Thread Andrzej Bialecki (JIRA)
Issue Type: Bug Reporter: Andrzej Bialecki Fix For: 2.0 In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls

[jira] Commented: (NUTCH-882) Design a Host table in GORA

2010-09-15 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909757#action_12909757 ] Andrzej Bialecki commented on NUTCH-882: - +1 to NutchContext. See also NUTCH-907

  1   2   >