[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2012-04-18 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13256423#comment-13256423 ] Julien Nioche commented on NUTCH-1314: -- I was under the impression that the patch did

[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2012-04-18 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13256437#comment-13256437 ] Julien Nioche commented on NUTCH-1314: -- This makes a good case for the merging of URL

[jira] [Commented] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first

2012-04-18 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13256531#comment-13256531 ] Julien Nioche commented on NUTCH-1297: -- Hi Ferdy Indeed, it is related but does not

[jira] [Commented] (NUTCH-1331) limit crawler to defined depth

2012-04-11 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13251684#comment-13251684 ] Julien Nioche commented on NUTCH-1331: -- This can be done with the ScoringFilters

[jira] [Commented] (NUTCH-1234) Upgrade to Tika 1.1

2012-04-02 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244143#comment-13244143 ] Julien Nioche commented on NUTCH-1234: -- Markus - you need to update the list of

[jira] [Commented] (NUTCH-1234) Upgrade to Tika 1.1

2012-03-31 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243096#comment-13243096 ] Julien Nioche commented on NUTCH-1234: -- Sure, will have a look at it next week

[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-03-29 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241085#comment-13241085 ] Julien Nioche commented on NUTCH-1024: -- Hi Markus Will have a closer look later. 2

[jira] [Commented] (NUTCH-809) Parse-metatags plugin

2012-03-22 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235468#comment-13235468 ] Julien Nioche commented on NUTCH-809: - Hi Lewis bq. Can you confirm what you would

[jira] [Commented] (NUTCH-809) Parse-metatags plugin

2012-03-21 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234316#comment-13234316 ] Julien Nioche commented on NUTCH-809: - Trunk : Committed revision 1303371. Not

[jira] [Commented] (NUTCH-1310) Nutch to send HTTP-accept header

2012-03-14 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229290#comment-13229290 ] Julien Nioche commented on NUTCH-1310: -- code location - same as property

[jira] [Commented] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first

2012-03-04 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221908#comment-13221908 ] Julien Nioche commented on NUTCH-1297: -- This can already be addressed by giving a

[jira] [Commented] (NUTCH-1293) IndexingFiltersChecker to store detected content type in crawldatum metadata

2012-03-01 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220087#comment-13220087 ] Julien Nioche commented on NUTCH-1293: -- wrong patch?

[jira] [Commented] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

2012-03-01 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220101#comment-13220101 ] Julien Nioche commented on NUTCH-1258: -- Weird. Yes, please do fix and commit if you

[jira] [Commented] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

2012-02-21 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212502#comment-13212502 ] Julien Nioche commented on NUTCH-1281: -- Behnam, I suppose that you are seeing this

[jira] [Commented] (NUTCH-1079) StringBuffer converted to StringBuilder

2012-02-18 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210870#comment-13210870 ] Julien Nioche commented on NUTCH-1079: -- Don't rely on me for this one. I am not in

[jira] [Commented] (NUTCH-1246) Upgrade to Hadoop 1.0.0

2012-02-18 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210871#comment-13210871 ] Julien Nioche commented on NUTCH-1246: -- open as not done in nutchgora AFAIK

[jira] [Commented] (NUTCH-1259) Store detected content type in crawldatum metadata

2012-02-14 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207724#comment-13207724 ] Julien Nioche commented on NUTCH-1259: -- good catch. Had overlooked the fact that the

[jira] [Commented] (NUTCH-1259) Store detected content type in crawldatum metadata

2012-02-14 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207753#comment-13207753 ] Julien Nioche commented on NUTCH-1259: -- bq. But what about segments fetched with and

[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

2012-02-10 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205490#comment-13205490 ] Julien Nioche commented on NUTCH-1259: -- I haven't looked at NUTCH-1024. Does it take

[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

2012-02-10 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205670#comment-13205670 ] Julien Nioche commented on NUTCH-1259: -- Nah, might as well do it in this one. Will

[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

2012-02-09 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204463#comment-13204463 ] Julien Nioche commented on NUTCH-1259: -- bq. // DO NOT ADD Content-Type FROM

[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

2012-02-07 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202474#comment-13202474 ] Julien Nioche commented on NUTCH-1259: -- bq. I'll commit this one tomorrow unless

[jira] [Commented] (NUTCH-1264) Configurable indexing plugin (index-extra)

2012-02-06 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201378#comment-13201378 ] Julien Nioche commented on NUTCH-1264: -- Attached a second version which does not

[jira] [Commented] (NUTCH-1005) Index headings plugin

2012-02-01 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197853#comment-13197853 ] Julien Nioche commented on NUTCH-1005: -- Markus, the parser should store the MD in

[jira] [Commented] (NUTCH-1005) Index headings plugin

2012-02-01 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197888#comment-13197888 ] Julien Nioche commented on NUTCH-1005: -- bq. I assume i have to disable the indexing

[jira] [Commented] (NUTCH-1005) Index headings plugin

2012-02-01 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197890#comment-13197890 ] Julien Nioche commented on NUTCH-1005: -- BTW if you can think of a better name for

[jira] [Commented] (NUTCH-1005) Index headings plugin

2012-02-01 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197898#comment-13197898 ] Julien Nioche commented on NUTCH-1005: -- bq. index-meta comes to mind! It's exactly

[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type

2012-01-31 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196829#comment-13196829 ] Julien Nioche commented on NUTCH-1262: -- Just wondering, does not Tika's Mimetype

[jira] [Commented] (NUTCH-1242) Allow disabling of URL Filters in ParseSegment

2012-01-31 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196950#comment-13196950 ] Julien Nioche commented on NUTCH-1242: -- Shouldn't this test for noFilter and

[jira] [Commented] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

2012-01-25 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13192997#comment-13192997 ] Julien Nioche commented on NUTCH-1258: -- What about using a similar mechanism for the

[jira] [Commented] (NUTCH-1254) NTLMv2 is not supported and HttpClient returns error code 500

2012-01-18 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188492#comment-13188492 ] Julien Nioche commented on NUTCH-1254: -- Should be done as part of

[jira] [Commented] (NUTCH-1246) Upgrade to Hadoop 1.0.0

2012-01-12 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185033#comment-13185033 ] Julien Nioche commented on NUTCH-1246: -- trunk : Committed revision 1230610

[jira] [Commented] (NUTCH-1244) CrawlDBDumper to filter by regex

2012-01-09 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182575#comment-13182575 ] Julien Nioche commented on NUTCH-1244: -- Not tested but looks Ok, compiles and passes

[jira] [Commented] (NUTCH-1244) CrawlDBDumper to filter by regex

2012-01-05 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13180434#comment-13180434 ] Julien Nioche commented on NUTCH-1244: -- duplicates

[jira] [Commented] (NUTCH-1244) CrawlDBDumper to filter by regex

2012-01-05 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13180458#comment-13180458 ] Julien Nioche commented on NUTCH-1244: -- yep. Would be good to add an optional filter

[jira] [Commented] (NUTCH-1241) CrawlDBScanner should also be able to find records

2012-01-04 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13179382#comment-13179382 ] Julien Nioche commented on NUTCH-1241: -- Entering '.+/product/.*' is not that

[jira] [Commented] (NUTCH-1241) CrawlDBScanner should also be able to find records

2012-01-04 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13179501#comment-13179501 ] Julien Nioche commented on NUTCH-1241: -- bq. However, due to NUTCH-1029 i cannot test

[jira] [Commented] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-12-19 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172397#comment-13172397 ] Julien Nioche commented on NUTCH-1184: -- Just managed to have a look and haven't seen

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-12-06 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163481#comment-13163481 ] Julien Nioche commented on NUTCH-1047: -- bq. If you'd need WARC files, for some

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-12-06 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163704#comment-13163704 ] Julien Nioche commented on NUTCH-1047: -- The class NutchIndexWriter and

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-12-05 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162704#comment-13162704 ] Julien Nioche commented on NUTCH-1047: -- It would be nice to have a plugin

[jira] [Commented] (NUTCH-1213) Pass additional SolrParams when indexing to Solr

2011-11-28 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13158429#comment-13158429 ] Julien Nioche commented on NUTCH-1213: -- Looks fine to me, feel free to go ahead and

[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2-SNAPSHOT

2011-11-24 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13156645#comment-13156645 ] Julien Nioche commented on NUTCH-1205: -- Why upgrading the sub-dependencies such as

[jira] [Commented] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-11-15 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150453#comment-13150453 ] Julien Nioche commented on NUTCH-1184: -- Markus, can you hold it until 1.4 is

[jira] [Commented] (NUTCH-1200) Resolving Ivy dependencies in several plugins

2011-11-11 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13148393#comment-13148393 ] Julien Nioche commented on NUTCH-1200: -- I'm definitely against the idea of putting

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-11-03 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143038#comment-13143038 ] Julien Nioche commented on NUTCH-1098: -- @Radim Sounds like I am not going to is your

[jira] [Commented] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]

2011-11-01 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141267#comment-13141267 ] Julien Nioche commented on NUTCH-1188: -- +1 to commit. See corresponding class in

[jira] [Commented] (NUTCH-882) Design a Host table in GORA

2011-10-31 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13140008#comment-13140008 ] Julien Nioche commented on NUTCH-882: - nope, go ahead Design a Host

[jira] [Commented] (NUTCH-1185) Decrease solr.commit.size

2011-10-31 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13140047#comment-13140047 ] Julien Nioche commented on NUTCH-1185: -- Or we could catch the exceptions (OOME or

[jira] [Commented] (NUTCH-672) allow unit tests to be run from bin/nutch

2011-09-29 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117213#comment-13117213 ] Julien Nioche commented on NUTCH-672: - You are welcome. Looks fine to me +1 to commit

[jira] [Commented] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)

2011-09-28 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116308#comment-13116308 ] Julien Nioche commented on NUTCH-1078: -- I had modified LogUtil in 2.0 (see