from:"Lewis John McGibbney \(Commented\) \(JIRA\)"

[jira] [Commented] (NUTCH-1306) Commit after finished writing to solr index

2012-04-03 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245798#comment-13245798
 ] 

Lewis John McGibbney commented on NUTCH-1306:
-

Hi Dan. In trunk, we have a number of nice features which I would like to bring 
to your attention. Maybe you can comment on whether you would like to see some 
of them go into Nutchgora?

Namely, NUTCH-1185, NUTCH-1000, NUTCH-996, NUTCH-991 and NUTCH-799

wdyt?

 Commit after finished writing to solr index
 ---

 Key: NUTCH-1306
 URL: https://issues.apache.org/jira/browse/NUTCH-1306
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Trivial
 Fix For: nutchgora

 Attachments: NUTCH-1306.patch


 Commit after finished writing to solr index - otherwise a bit confusing not 
 seeing the number of docs we expect in solr

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-03-29 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241207#comment-13241207
 ] 

Lewis John McGibbney commented on NUTCH-1024:
-

I like this Markus. Although I need to be honest and say that I've not had time 
to give it a spin as of recent so apologies for this. It looks like the process 
to date has been a bit frustrating so I apologize for not chipping in earlier. 
Anyway, we don't rely on commons for logging, could you please replace this with
{code}
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
{code}

Another further point from me:

You make refernce to the following conf directories
{code}
SCHEDULE_INC_RATE = db.fetch.schedule.adaptive.inc_rate;
SCHEDULE_DEC_RATE = db.fetch.schedule.adaptive.dec_rate;
SCHEDULE_MIME_FILE = db.fetch.schedule.mime.file;
{code}

Although I don't see the new MIME_FILE added to the patch, I also don't see the 
INC and DEC properties added to nutch-default.xml
Thanks


 Dynamically set fetchInterval by MIME-type
 --

 Key: NUTCH-1024
 URL: https://issues.apache.org/jira/browse/NUTCH-1024
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: AdaptiveFetchSchedule.patch, 
 MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, 
 NUTCH-1024-1.5-2.patch, Nutch.patch, adaptive-mimetypes.txt


 Add facility to configure default or fixed fetchInterval values by MIME-type. 
 This is useful for conserving resources for files that are known to change 
 frequently or never and everything in between.
 * simple key\tvalue\n configuration file
 * only set fetchInterval for new documents
 * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

2012-03-29 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241212#comment-13241212
 ] 

Lewis John McGibbney commented on NUTCH-1320:
-

Nice Markus. +1. Is there scope for this to be applied elsewhere, or is 
parserchecker the only instance (so far) where you've encountered the problem?

 IndexChecker and ParseChecker choke on IDN's
 

 Key: NUTCH-1320
 URL: https://issues.apache.org/jira/browse/NUTCH-1320
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1320-1.5-1.patch


 These handy debug tools do not handle IDN's and throw an NPE
 bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
 {code}
 Exception in thread main java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-366) Merge URLFilters and URLNormalizers

2012-03-21 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234349#comment-13234349
]

Lewis John McGibbney commented on NUTCH-366:

Hi Apurv this is great news :)
I suggest that if you have not already done so, take a look at NUTCH-365. Try
to put the material Andrzej mentioned into context. In parallel I would take a
look at the way the current URLFIlters and URLNormalizers are constructed with
regards to 1 as above. It would be great to get this moving as a GSoC project.

Merge URLFilters and URLNormalizers
---

Key: NUTCH-366
URL: https://issues.apache.org/jira/browse/NUTCH-366
Project: Nutch
Issue Type: Improvement
Reporter: Andrzej Bialecki
Labels: gsoc2012

Currently Nutch uses two subsystems related to url validation and
normalization:
* URLFilter: this interface checks if URLs are valid for further processing.
Input URL is not changed in any way. The output is a boolean value.
* URLNormalizer: this interface brings URLs to their base (normal) form, or
removes unneeded URL components, or performs any other URL mangling as
necessary. Input URLs are changed, and are returned as result.
However, various Nutch tools run filters and normalizers in pre-determined
order, i.e. normalizers first, and then filters. In some cases, where
normalizers are complex and running them is costly (e.g. numerous regex
rules, DNS lookups) it would make sense to run some of the filters first
(e.g. prefix-based filters that select only certain protocols, or
suffix-based filters that select only known extensions). This is currently
not possible - we always have to run normalizers, only to later throw away
urls because they failed to pass through filters.
I would like to solicit comments on the following two solutions, and work on
implementation of one of them:
1) we could make URLFilters and URLNormalizers implement the same interface,
and basically make them interchangeable. This way users could configure their
order arbitrarily, even mixing filters and normalizers out of order. This is
more complicated, but gives much more flexibility - and NUTCH-365 already
provides sufficient framework to implement this, including the ability to
define different sequences for different steps in the workflow.
2) we could use a property url.mangling.order ;) to define whether
normalizers or filters should run first. This is simple to implement, but
provides only limited improvement - because either all filters or all
normalizers would run, they couldn't be mixed in arbitrary order.
Any comments?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-809) Parse-metatags plugin

2012-03-21 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235108#comment-13235108
 ] 

Lewis John McGibbney commented on NUTCH-809:


Hi Julien,

Can you confirm what you would like to see added to the wiki?, I will try my 
best to get this added, are you referring to the [0]? Also I thought the best 
thing to do regarding porting to Nutchgora is just to add it to the ever 
growing NUTCH-1104 list, so I have done so. If and when this is required over 
there someone can duly oblige :)
Regarding adding fields to Solr I assume you mean schema and solr-mapping.xml?
Finally can you expand on 'activate by default', what exactly is it that not 
activated by default? I read your README.txt but I can see any mention of it in 
there.   
Thanks

Oh and great patch, this is one which as we know is very much appreciated by 
everyone. 
[0] http://wiki.apache.org/nutch/IndexStructure

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4, nutchgora
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.5

 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, 
 NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip


 h2. Parse-metatags plugin
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The query-basic plugin is used to include these fields in the search e.g. in 
 nutch-site.xml
 {code:xml}
 property
   namequery.basic.description.boost/name
   value2.0/value
 /property
 property
   namequery.basic.keywords.boost/name
   value2.0/value
 /property
 {code}
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1317) Max content length by MIME-type

2012-03-20 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233828#comment-13233828
 ] 

Lewis John McGibbney commented on NUTCH-1317:
-

Do you have any indication as to why this is Markus? Which plugin are you using 
to parse your html?

 Max content length by MIME-type
 ---

 Key: NUTCH-1317
 URL: https://issues.apache.org/jira/browse/NUTCH-1317
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5


 The good old http.content.length directive is not sufficient in large 
 internet crawls. For example, a 5MB PDF file may be parsed without issues but 
 a 5MB HTML file may time out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-03-19 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232588#comment-13232588
]

Lewis John McGibbney commented on NUTCH-978:

Great Ammar. Are you wanting to add this as a GSoC2012 project? I am already
mentoring one project, and time/work restrictions mean that I can't step up to
take on another mentoring role. If you don't wish to make this a project this
year, at least the code is on here for guys to pick it up in the future.

[GSoC 2011] A Plugin for extracting certain element of a web page on html
page parsing.
---

Key: NUTCH-978
URL: https://issues.apache.org/jira/browse/NUTCH-978
Project: Nutch
Issue Type: New Feature
Components: parser
Affects Versions: 1.2
Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
Labels: gsoc2011, mentor
Fix For: nutchgora

Attachments:
[Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf,
app_guardian_ivory_coast_news_exmpl.png,
app_screenshoot_configuration_result.png,
app_screenshoot_configuration_result_anchor.png,
app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png,
for_GSoc.zip, version_alpha2.zip

Original Estimate: 1,680h
Remaining Estimate: 1,680h

Nutch use parse-html plugin to parse web pages, it process the contents of
the web page by removing html tags and component like javascript and css and
leaving the extracted text to be stored on the index. Nutch by default
doesn't have the capability to select certain atomic element on an html page,
like certain tags, certain content, some part of the page, etc.
A html page have a tree-like xml pattern with html tag as its branch and text
as its node. This branch and node could be extracted using XPath. XPath
allowing us to select a certain branch or node of an XML and therefore could
be used to extract certain information and treat it differently based on its
content and the user requirements. Furthermore a web domain like news website
usually have a same html code structure for storing the information on its
web pages. This same html code structure could be parsed using the same XPath
query and retrieve the same content information element. All of the XPath
query for selecting various content could be stored on a XPath Configuration
File.
The purpose of nutch are for various web source, not all of the web page
retrieved from those various source have the same html code structure, thus
have to be threated differently using the correct XPath Configuration. The
selection of the correct XPath configuration could be done automatically
using regex by matching the url of the web page with valid url pattern for
that xpath configuration.
This automatic mechanism allow the user of nutch to process various web page
and get only certain information that user wants therefore making the index
more accurate and its content more flexible.
The component for this idea have been tested on nutch 1.2 for selecting
certain elements on various news website for the purpose of document
clustering. This includes a Configuration Editor Application build using
NetBeans 6.9 Application Framework. though its need a few debugging.
http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

[jira] [Commented] (NUTCH-1315) reduce speculation on but ParseOutputFormat doesn't name output files correctly?

2012-03-19 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232836#comment-13232836
 ] 

Lewis John McGibbney commented on NUTCH-1315:
-

Regarding your comment e.g. does not turn on reduce speculation, my initial 
thought it no. I will try to confirm/iron out. Do you have any speculation 
settings configured for Hadoop at all?

 reduce speculation on but ParseOutputFormat doesn't name output files 
 correctly?
 

 Key: NUTCH-1315
 URL: https://issues.apache.org/jira/browse/NUTCH-1315
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: ubuntu 64bit, hadoop 1.0.1, 3 Node Cluster, segment size 
 1.5M urls
Reporter: Rafael
  Labels: hadoop, hdfs

 From time to time the Reducer log contains the following and one tasktracker 
 gets blacklisted.
 org.apache.hadoop.ipc.RemoteException: 
 org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to 
 create file 
 /user/test/crawl/segments/20120316065507/parse_text/part-1/data for 
 DFSClient_attempt_201203151054_0028_r_01_1 on client xx.x.xx.xx.10, 
 because this file is already being created by 
 DFSClient_attempt_201203151054_0028_r_01_0 on xx.xx.xx.9
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1404)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1244)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1186)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628)
   at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
   at org.apache.hadoop.ipc.Client.call(Client.java:1066)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
   at $Proxy2.create(Unknown Source)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
   at $Proxy2.create(Unknown Source)
   at 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.init(DFSClient.java:3245)
   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:182)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555)
   at 
 org.apache.hadoop.io.SequenceFile$RecordCompressWriter.init(SequenceFile.java:1132)
   at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
   at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354)
   at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476)
   at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:157)
   at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:134)
   at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:92)
   at 
 org.apache.nutch.parse.ParseOutputFormat.getRecordWriter(ParseOutputFormat.java:110)
   at 
 org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.init(ReduceTask.java:448)
   at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:490)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 I asked the hdfs-user mailing list and i got the following answer:
 Looks

[jira] [Commented] (NUTCH-1273) Fix [deprecation] javac warnings

2012-03-18 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232309#comment-13232309
 ] 

Lewis John McGibbney commented on NUTCH-1273:
-

Still some work to be done with trunk. Rolling back changes with Nutchgora as 
I've broken it :( I'll try to pick this up again soon.

 Fix [deprecation] javac warnings
 

 Key: NUTCH-1273
 URL: https://issues.apache.org/jira/browse/NUTCH-1273
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.5

 Attachments: NUTCH-1273-nutchgora.patch, NUTCH-1273-trunk.patch, 
 NUTCH-1273-v2-trunk.patch


 As part of this task, these warnings should be resolved, however this 
 particular strand of warnings can either be resolved by adding
 {code}
 @SuppressWarnings(deprecation)
 {code}
 or by actually upgrading our class usage to rely upon non-deprecated classes. 
 Which option is more appropriate for the project?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1310) Nutch to send HTTP-accept header

2012-03-15 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230400#comment-13230400
 ] 

Lewis John McGibbney commented on NUTCH-1310:
-

Looks good to me Markus. +1

 Nutch to send HTTP-accept header
 

 Key: NUTCH-1310
 URL: https://issues.apache.org/jira/browse/NUTCH-1310
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1310-1.5-1.patch


 Nutch does not send a HTTP-accept header with its requests. This is usually 
 not a problem but some firewall do not like it and will reject the request.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-882) Design a Host table in GORA

2012-03-12 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227963#comment-13227963
 ] 

Lewis John McGibbney commented on NUTCH-882:


Mathijs, my opinion is that you have a clean sheet of paper to begin with 
certain aspects of this one (simply because you've stepped up to take it on). 
You obviously have you own idea about how you would like to see the new host 
table design and also have justification behind the eventual implementation 
(and API break/redesign) of NutchContext. I think it's wise to think sensibly 
about NOT breaking the plugin API at this stage and that an incremental 
approach to addressing this one is a suitable strategy. Feel free to open 
another issue for the NutchContext issue, as quite rightly this appears to have 
now morphed into it's own sub domain of the umbrella issue. 

 Design a Host table in GORA
 ---

 Key: NUTCH-882
 URL: https://issues.apache.org/jira/browse/NUTCH-882
 Project: Nutch
  Issue Type: New Feature
Affects Versions: nutchgora
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: nutchgora

 Attachments: NUTCH-882-v1.patch, hostdb.patch


 Having a separate GORA table for storing information about hosts (and 
 domains?) would be very useful for : 
 * customising the behaviour of the fetching on a host basis e.g. number of 
 threads, min time between threads etc...
 * storing stats
 * keeping metadata and possibly propagate them to the webpages 
 * keeping a copy of the robots.txt and possibly use that later to filter the 
 webtable
 * store sitemaps files and update the webtable accordingly
 I'll try to come up with a GORA schema for such a host table but any comments 
 are of course already welcome 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1304) GeneratorMapper.java dosen't return when skipping and already generated mark

2012-03-08 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225159#comment-13225159
 ] 

Lewis John McGibbney commented on NUTCH-1304:
-

+1 for commit. I'll wait until this afternoon to hear back from anyone else 
before doing so. Thanks Dan.

 GeneratorMapper.java dosen't return when skipping and already generated mark
 

 Key: NUTCH-1304
 URL: https://issues.apache.org/jira/browse/NUTCH-1304
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1304.patch


 GeneratorMapper.java dosen't return when skipping and already generated mark

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1305) Domain(blacklist)URLFilter to trim entries

2012-03-08 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225206#comment-13225206
 ] 

Lewis John McGibbney commented on NUTCH-1305:
-

+1

 Domain(blacklist)URLFilter to trim entries
 --

 Key: NUTCH-1305
 URL: https://issues.apache.org/jira/browse/NUTCH-1305
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1305-1.5-1.patch


 Both filters should handle entries with trailing whitespace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1304) GeneratorMapper.java dosen't return when skipping and already generated mark

2012-03-08 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225270#comment-13225270
 ] 

Lewis John McGibbney commented on NUTCH-1304:
-

Please close this one off when you have time Dan you.

 GeneratorMapper.java dosen't return when skipping and already generated mark
 

 Key: NUTCH-1304
 URL: https://issues.apache.org/jira/browse/NUTCH-1304
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1304.patch


 GeneratorMapper.java dosen't return when skipping and already generated mark

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-728) Improve nutch release packaging

2012-03-08 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225389#comment-13225389
 ] 

Lewis John McGibbney commented on NUTCH-728:


Looking at this, then at what we have available on our mirrors, I don't really 
see the need at the moment (unless it would make release process easier) of 
including this code. Chris already provides us with src.tar.gz with every 
release?
I suppose this ones really down to release manager's opinion. 

 Improve nutch release packaging
 ---

 Key: NUTCH-728
 URL: https://issues.apache.org/jira/browse/NUTCH-728
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
 Attachments: NUTCH-728-nutchgora.patch, NUTCH-728-v2.patch, 
 NUTCH-728.patch


 see the discussion from 
 http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions

2012-03-06 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223504#comment-13223504
 ] 

Lewis John McGibbney commented on NUTCH-1253:
-

Hi Ferdy, the patches I attached were identical for branch Nutchgora and trunk. 
I would have assumed if trunk was incorrect then Nutchgora would have shadowed 
this behaviour. 

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
 Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-945) Indexing to multiple SOLR Servers

2012-03-04 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221898#comment-13221898
 ] 

Lewis John McGibbney commented on NUTCH-945:


On user@ Julien passed some excellent comments on this one [0]. My opinion is 
that I would like to see these incorporated, admittedly I've not checked the 
patch out Sujit (so please excuse if these points are addressed). . My 
justification behind this is simply longevity. Markus stated 

{bq}If Solr 4.0 is released in the coming months (and that's what it looks 
like) i 
would suggest to patch Nutch to allow for a list of Solr server URL's instead 
of doing partitioning on the client site.
{bq}

Which I agree with, however until we witness a Solr 4.0 release (currently 
sitting @ 348 issues [2]) I don't see why this can't be integrated into 
Nutchgora.


[0] http://www.mail-archive.com/user@nutch.apache.org/msg05664.html
[1] http://www.mail-archive.com/user@nutch.apache.org/msg05674.html
[2] 
https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQuery=project+%3D+SOLR+AND+resolution+%3D+Unresolved+AND+fixVersion+%3D+%224.0%22+ORDER+BY+priority+DESCmode=hide

 Indexing to multiple SOLR Servers
 -

 Key: NUTCH-945
 URL: https://issues.apache.org/jira/browse/NUTCH-945
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Charan Malemarpuram
 Attachments: MurmurHashPartitioner.java, 
 NonPartitioningPartitioner.java, patch-NUTCH-945.txt


 It would be nice to have a default Indexer in Nutch, which can submit docs to 
 multiple SOLR Servers.
  Partitioning is always the question, when writing to multiple SOLR Servers.
  Default partitioning can be a simple hashcode based distribution with 
  addition hooks to customization.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1291) Fetcher to stringify exception on // unexpected exception

2012-02-29 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219222#comment-13219222
 ] 

Lewis John McGibbney commented on NUTCH-1291:
-

It's a plus 1 from me mate :)

 Fetcher to stringify exception on // unexpected exception
 -

 Key: NUTCH-1291
 URL: https://issues.apache.org/jira/browse/NUTCH-1291
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.5

 Attachments: NUTCH-1291-1.5-1.patch


 During development we sometimes saw a less than helpful exception e.g. fetch 
 of http://www.openindex.io/en/home.html failed with: 
 java.lang.NullPointerException. This error must be a bit more descriptive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-670) feed plugin does not parse RSS2 enclosures

2012-02-28 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13218098#comment-13218098
 ] 

Lewis John McGibbney commented on NUTCH-670:


Sure is. Not to worry. Thanks

 feed plugin does not parse RSS2 enclosures
 --

 Key: NUTCH-670
 URL: https://issues.apache.org/jira/browse/NUTCH-670
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 The feed parse in plugins/feed does not get count links found in RSS2 
 enclosure tags as Outlinks.
 It's a pretty simple patch - SyndEntry has a getEnclosures call. I'll submit 
 the patch tomorrow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1289) In distributed mode URL's are not partitioned

2012-02-27 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217155#comment-13217155
 ] 

Lewis John McGibbney commented on NUTCH-1289:
-

Hi Dan, thanks for opening this issue and for the patch. Are you using trunk at 
all? If so is it possible to confirm if this functionality is already running 
in trunk... if not then we can get a patch cooked up.

 In distributed mode URL's are not partitioned
 -

 Key: NUTCH-1289
 URL: https://issues.apache.org/jira/browse/NUTCH-1289
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: nutchgora
Reporter: Dan Rosher
 Fix For: nutchgora

 Attachments: NUTCH-1289.patch


 In distributed mode URL's are not partitioned to a specific machine which 
 means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1289) In distributed mode URL's are not partitioned

2012-02-27 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217159#comment-13217159
 ] 

Lewis John McGibbney commented on NUTCH-1289:
-

Markus, what is your opinion as to which suits best? Or is it the case in 
Nutchgora that Dan's patch is more appropriate?

 In distributed mode URL's are not partitioned
 -

 Key: NUTCH-1289
 URL: https://issues.apache.org/jira/browse/NUTCH-1289
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: nutchgora
Reporter: Dan Rosher
 Fix For: nutchgora

 Attachments: NUTCH-1289.patch


 In distributed mode URL's are not partitioned to a specific machine which 
 means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1286) Refactoring/reimplementing crawling API (NutchApp)

2012-02-26 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216710#comment-13216710
]

Lewis John McGibbney commented on NUTCH-1286:
-

For reference, an brief description from Marko regarding the UI which was
designed here [0].

+ a new extension point that describes ui component
+ a ui component is a plugin uses backend classes from nutch to provide
functionality (e.g. inject, fetch, configuration or whatever)
+ a ui component can deploy to a webserver as a new webapp
+ a application that was starting a webserver e.g.jetty and deploy all
implemented ui components to the webserver

the goal was to use the plugin api to develop separately ui components that can
be deploy to the webserver as a new context.

+ every ui compoment can have more than one instance
+ with this approach we was able to create different type of crawls (e.g. fast
crawl, long running crawl ...)
+ every type has one instance of a ui compoment

+ an important ui component we implemented was a component to configure the
Configuration object
+ with that you can configure your crawl instance with different plugins or
different configurations for a fetcher or whatever

our ui components was directly using the nutch backend.

It would be nice to compile a diff list describing changes between
implementations.

[0] https://github.com/101tec/nutch

Refactoring/reimplementing crawling API (NutchApp)
--

Key: NUTCH-1286
URL: https://issues.apache.org/jira/browse/NUTCH-1286
Project: Nutch
Issue Type: Improvement
Components: administration gui, REST_api, web gui
Reporter: Ferdy Galema

This issue is to track changes we (Mathijs and I) have planned for the API
and webapp in Nutchgora. We have a pretty good idea of how we want to be
using the crawl API. It may involve some major refactoring or perhaps a side
implementation next the current NutchApp functionality. It depends on how
much we can reuse the existing components. The bottom line is that there will
be a strictly defined Java API that provide everyting related from
crawling/indexing to job control. (Listing jobs, tracking progress and
aborting jobs being part of it). There will be no server or service for
tracking crawling states, all will be persisted one way or the other and
queryable from the API. The REST server shall be a very thin layer on top of
the Java implementation. A rich web interface will be very easy layer too,
once we have a cleanly (but extensive) defined API. But we will start to make
to API usable from a simple command-line interface.
More details will be provided later on.. feel free to comment if you have
suggestions/questions.

[jira] [Commented] (NUTCH-728) Improve nutch release packaging

2012-02-26 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216750#comment-13216750
 ] 

Lewis John McGibbney commented on NUTCH-728:


Ok to commit?

 Improve nutch release packaging
 ---

 Key: NUTCH-728
 URL: https://issues.apache.org/jira/browse/NUTCH-728
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
 Attachments: NUTCH-728-nutchgora.patch, NUTCH-728-v2.patch, 
 NUTCH-728.patch


 see the discussion from 
 http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions

2012-02-26 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216751#comment-13216751
 ] 

Lewis John McGibbney commented on NUTCH-1253:
-

Anyone had time to try this one out?

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
 Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

2012-02-23 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214598#comment-13214598
 ] 

Lewis John McGibbney commented on NUTCH-965:


Yeah this is confirmed Ferdy. I spun a build and your right. Another headache 
to deal with :) Relentless!

 Skip parsing for truncated documents
 

 Key: NUTCH-965
 URL: https://issues.apache.org/jira/browse/NUTCH-965
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Alexis
Assignee: Lewis John McGibbney
 Fix For: nutchgora, 1.5

 Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, 
 NUTCH-965-v3-trunk.txt, parserJob.patch


 The issue you're likely to run into when parsing truncated FLV files is 
 described here:
 http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
 The parser library gets stuck in infinite loop as it encounters corrupted 
 data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212576#comment-13212576
]

Lewis John McGibbney commented on NUTCH-978:

No bother Chris. So far questions that have been asked
1. provide a quick run down on the issue, summarizing all of the above
2. what were the motivations, purpose and technical challenges encountered
whilst working on it?
3. Why did the issue drop away and what do you think is required to get it back
on track and possibly in the codebase?

[GSoC 2011] A Plugin for extracting certain element of a web page on html
page parsing.
---

Original Estimate: 1,680h
Remaining Estimate: 1,680h

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212582#comment-13212582
]

Lewis John McGibbney commented on NUTCH-978:

Replies:

1 2. The main motivation of this issue is for processing news document
required for my undergrad thesis of Bahasa Indonesia news text
clustering, it's needed a prepossessing to extract the title, news
content, date, related news link separately.

2. The most biggest technical challenge for me is processing the web page
so it could be parsered as an XML document and could be queried by
XPath.

3. The issue is drop away, because with a small tweak a could get it
working for only my thesis requirements, i haven't tested it with
web page other than the web pages i used for my thesis so i think it's
not anyway nearly finished yet. And since the proposal is not accepted
as a GSOC project, i lost motivation to continue to work on this issue
and decided to work on my thesis instead.

related issue : https://issues.apache.org/jira/browse/NUTCH-185

[GSoC 2011] A Plugin for extracting certain element of a web page on html
page parsing.
---

Original Estimate: 1,680h
Remaining Estimate: 1,680h

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212584#comment-13212584
]

Lewis John McGibbney commented on NUTCH-978:

Generally speaking the plugin sounds sounds really useful, the only problem I
see is that it is very specific and for it to be integrated into the code base
usually we need to make it specific enough to address some given task fully and
in a well defined and well justified manner, but we also need to make it
general enough to be used in many different contexts. This increases usability
and user feedback as well engagement.

4. With regards to the biggest technical challenge being the processing of web
page's, how far did you get with this? We're you able to process it with enough
precision to satisfy your requirements?

5. How were you querying it with XPath? You cannot query with XPath, but
instead with XQuery. Do you maybe mean that this enabled you to navigate the
document and address various parts of it is XPath?

6. Ok I understand why it has crumbled slightly, but I think if the code is
there is would be a huge waster if we didn't try to revive it, possibly getting
it integrated into the code base, and maybe getting it added as a contrib
component but not shipping it within the core codebase if the former was not a
viable option.

I've had a look at NUTCH-185, but I think we can discard this as it was
addressed a very long time ago, it's also already integrated into the codebase.
I was referring more to Jira issues which were currently open, which we could
maybe merge or combine to give this a more general and possibly more justified
arguement for inclusion in the codebase... what do you think? Does NUTCH-585
fit this?

[GSoC 2011] A Plugin for extracting certain element of a web page on html
page parsing.
---

Original Estimate: 1,680h
Remaining Estimate: 1,680h

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly,

[jira] [Commented] (NUTCH-1001) bin/nutch fetch/parse handle crawl/segments directory

2012-02-20 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211779#comment-13211779
 ] 

Lewis John McGibbney commented on NUTCH-1001:
-

Great :0)

 bin/nutch fetch/parse handle crawl/segments directory
 -

 Key: NUTCH-1001
 URL: https://issues.apache.org/jira/browse/NUTCH-1001
 Project: Nutch
  Issue Type: Improvement
Reporter: Gabriele Kahlout
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1001.patch


 I'm having issues porting scripts across different systems to support the 
 step of extracting the latest/only segments resulting from the generate phase.
 Variants include:
 $ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` #[1]
 $ s1=`ls -d crawl/segments/2* | tail -1` #[2]
 $ segment=`$HADOOP_HOME/bin/hadoop dfs -ls crawl/segments | tail -1 | grep -o 
 [a-zA-Z0-9/\-]* |tail -1`
 $ segment=`$HADOOP_HOME/bin/hdfs -ls crawl/segments | tail -1 | grep -o 
 [a-zA-Z0-9/\-]* |tail -1`
 And I'm not sure what windows users would have to do. Some users may also do 
 with:
 bin/nutch fetch with crawl/segments/2*
 But I don't see a need in having the user extract/worry-about the latest/only 
 segment, and have it a described step in every nutch tutorial. More over only 
 fetch and parse expect a segment while other commands are fine with the 
 directory of segments.
 Therefore, I think it's beneficial if fetch and parse also handle directories 
 of segments. 
 [1] http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
 [2] http://wiki.apache.org/nutch/NutchTutorial#Command_Line_Searching

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

2012-02-19 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211314#comment-13211314
 ] 

Lewis John McGibbney commented on NUTCH-1281:
-

Hi behnam, there is a similar issue open and a patch has been submitted for 
Nutchgora. I wonder if you can check it out and comment on the link between 
these two. NUTCH-965

Also would it be possible for you to attach your code changes as a patch 
against trunk? Which I guess is what you are using. Thank you

 tika parser not work properly with unwanted file types that passed from 
 filters in nutch
 

 Key: NUTCH-1281
 URL: https://issues.apache.org/jira/browse/NUTCH-1281
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: behnam nikbakht

 when in parse-plugins.xml, set this property:
 mimeType name=*
 plugin id=parse-tika /
 /mimeType
 all unwanted files that pass from all filters, refered to tika
 but for some file types like .flv, tika parser has problem and hunged and 
 cause to fail in parse Job.
 if this file types passed from regex-urlfilter and other filters, parse job 
 failed.
 for this problem I suggest that add some properties for valid file types, and 
 use this code in TikaParser.java, like this:
 public ParseResult getParse(Content content) {
   String mimeType = content.getContentType();
 + String[]validTypes=new 
 String[]{application/pdf,application/x-tika-msoffice,application/x-tika- 
 ooxml,application/vnd.oasis.opendocument.text,text/plain,application/rtf,application/rss+xml,application/x-bzip2,application/x-gzip,application/x-javascript,application/javascript,text/javascript,application/x-shockwave-flash,application/zip,text/xml,application/xml};
 + boolean valid=false;
 + for(int k=0;kvalidTypes.length;k++){
 + if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
 + valid=true;
 + }
 + if(!valid)
 + return new ParseStatus(ParseStatus.NOTPARSED, Can't 
 parse for unwanted filetype + 
 mimeType).getEmptyParseResult(content.getUrl(), getConf());
   
   URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host

2012-02-19 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211320#comment-13211320
 ] 

Lewis John McGibbney commented on NUTCH-1278:
-

Behnam, this looks interesting but there are a few problems here.
1) It would be much much easier for us to apply, test and comment on your 
contribution if you included it in a simple .patch file. This can be done like 
so 
{code}
$ cd $NUTCH_HOME
$ svn diff  NUTCH-patch-name.patch
{code}
The current zip format for the patch(es), plus the fact that every class has 
been patched separately from thier own respective directories makes it really 
hard for us to work with this.
2) I doesn't appear that this patch is actually applies against trunk? Maybe 
1.4? You can check out trunk here [1] I'm getting errors when trying to apply 
HttpBase then gave up and started writing this.
3) for a change to the fetcher of this scale, it would be really nice if you 
could provide a test within the test suite we already maintain [2].

As I said this looks really great, and sorry for the rather lengthy initial 
response, but for us to consider this for integration it would be great for 
your contributions to meet this minimum requirement as they are highly 
appreciated. Thank you

[1] https://svn.apache.org/repos/asf/nutch/trunk/
[2] 
https://svn.apache.org/viewvc/nutch/trunk/src/test/org/apache/nutch/fetcher/TestFetcher.java?view=markup


 Fetch Improvement in threads per host
 -

 Key: NUTCH-1278
 URL: https://issues.apache.org/jira/browse/NUTCH-1278
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht
 Attachments: NUTCH-1278.zip


 the value of maxThreads is equal to fetcher.threads.per.host and is constant 
 for every host
 there is a possibility with using of dynamic values for every host that 
 influeced with number of blocked requests.
 this means that if number of blocked requests for one host increased, then we 
 most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-929) Create a REST-based admin UI for Nutch

2012-02-19 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211404#comment-13211404
 ] 

Lewis John McGibbney commented on NUTCH-929:


As we are using org.restlet as the underlying RESTlet framework, we will need 
to utilise the presentation technologies supported. e.g integration with three 
popular template technologies : XSLT, FreeMarker or Apache Velocity.

[1] 
http://wiki.restlet.org/docs_2.0/13-restlet/21-restlet/378-restlet/116-restlet.html

 Create a REST-based admin UI for Nutch
 --

 Key: NUTCH-929
 URL: https://issues.apache.org/jira/browse/NUTCH-929
 Project: Nutch
  Issue Type: New Feature
  Components: administration gui
Affects Versions: nutchgora
Reporter: Andrzej Bialecki 

 This is a follow up to NUTCH-880 - we need to expose the functionality of 
 REST API in a user-friendly admin UI. Thanks to the nature of the API the UI 
 can be implemented in any UI framework that speaks REST/JSON, so it could be 
 a simple webapp (we already have jetty) or a Swing / Pivot / etc standalone 
 application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1273) Fix [deprecation] javac warnings

2012-02-19 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211464#comment-13211464
 ] 

Lewis John McGibbney commented on NUTCH-1273:
-

With this issue, do we wish to simply suppress the warnings? What other options 
do we have? It makes me think that we could upgrade the use of classes within 
our library dependencies. Any ideas?

 Fix [deprecation] javac warnings
 

 Key: NUTCH-1273
 URL: https://issues.apache.org/jira/browse/NUTCH-1273
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.5


 As part of this task, these warnings should be resolved, however this 
 particular strand of warnings can either be resolved by adding
 {code}
 @SuppressWarnings(deprecation)
 {code}
 or by actually upgrading our class usage to rely upon non-deprecated classes. 
 Which option is more appropriate for the project?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-19 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211470#comment-13211470
]

Lewis John McGibbney commented on NUTCH-978:

Hi Chris did you mentor this project through GSoC? I've downloaded the .zip
available in the description (which I've also attached in case the link goes
AWOL) and I'm going to play about with it. I'll attach it as a patch if I get
anywhere.

[GSoC 2011] A Plugin for extracting certain element of a web page on html
page parsing.
---

Original Estimate: 1,680h
Remaining Estimate: 1,680h

[jira] [Commented] (NUTCH-1079) StringBuffer converted to StringBuilder

2012-02-18 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210905#comment-13210905
 ] 

Lewis John McGibbney commented on NUTCH-1079:
-

I kinda got this feeling Julien. Thanks. We'll I think based on the discussion 
above, there seems to be no overwhelming reason for changing all of this. You 
did however begin to make a point of sorts Markus, any thoughts now that this 
one has had a bit of time to settle in?

 StringBuffer converted to StringBuilder
 ---

 Key: NUTCH-1079
 URL: https://issues.apache.org/jira/browse/NUTCH-1079
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, indexer
Reporter: Karthik K
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1079.patch, NUTCH-rel_14-1079.patch


 All across the codebase, it contains StringBuffer, when thread-safety is 
 probably not intended. 
 This patch replaces StringBuffer to StringBuilder, as applicable. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1246) Upgrade to Hadoop 1.0.0

2012-02-18 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210937#comment-13210937
 ] 

Lewis John McGibbney commented on NUTCH-1246:
-

Committed @ revision 1245921 in nutchgora thanks Julien. There was one small 
change where the jackson dependency was related to org.restlet instead of 
org.codehaus, it is integral to some nucthgora functionality so couldn't be 
removed. Also hadoop- test dependencies have been upgraded to 0.20.205.  


 Upgrade to Hadoop 1.0.0
 ---

 Key: NUTCH-1246
 URL: https://issues.apache.org/jira/browse/NUTCH-1246
 Project: Nutch
  Issue Type: Improvement
Affects Versions: nutchgora, 1.5
Reporter: Julien Nioche



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1210) DomainBlacklistFilter

2012-02-17 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210188#comment-13210188
 ] 

Lewis John McGibbney commented on NUTCH-1210:
-

Hey Markus. In /conf we also have .template files for current filters of this 
nature. I don't know if you want to include one of those :0| 

 DomainBlacklistFilter
 -

 Key: NUTCH-1210
 URL: https://issues.apache.org/jira/browse/NUTCH-1210
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1210-1.5-1.patch


 The current DomainFilter acts as a white list. We also need a filter that 
 acts as a black list so we can allow tld's and/or domains with DomainFilter 
 but blacklist specific subdomains. If we would patch the current DomainFilter 
 for this behaviour it would break current semantics such as it's precedence. 
 Therefore i would propose a new filter instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1246) Upgrade to Hadoop 1.0.0

2012-02-17 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210312#comment-13210312
 ] 

Lewis John McGibbney commented on NUTCH-1246:
-

How is this issue?

 Upgrade to Hadoop 1.0.0
 ---

 Key: NUTCH-1246
 URL: https://issues.apache.org/jira/browse/NUTCH-1246
 Project: Nutch
  Issue Type: Improvement
Affects Versions: nutchgora, 1.5
Reporter: Julien Nioche



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2012-02-17 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210498#comment-13210498
 ] 

Lewis John McGibbney commented on NUTCH-585:


I like this contribution Elisabeth. Is there any way it could be updated to 
trunk with the following suggestions
1) Please rename the package names to org.apache.nutch.blah.blah
2) In your ivy.xml please change the ivy-configuration.xml to
{code}
  configurations
  include file=../../..//ivy/ivy-configurations.xml/
  /configurations
{code}
This is eclipse specific.
3) Would it be possible to change the CHANGES.txt to package.html and store it 
in the lowest most folder within the java directory
4) It would really put the cherry on top if we could get a test case scenario, 
this would be a big +1.
5) I think the name is maybe a bit large... but I am fine keeping it if you 
think it is appropriate as it is your patch afterall.

Thank you for the contribution.

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: blacklist_whitelist_plugin.patch, 
 nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1001) bin/nutch fetch/parse handle crawl/segments directory

2012-02-17 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210506#comment-13210506
 ] 

Lewis John McGibbney commented on NUTCH-1001:
-

Hi Gabriele are you interested in incorporating the comments into this patch? 
It was unfortunate not to get in to 1.4, but we have no immediate plan for 1.5 
so it would be great to revive this issue?

 bin/nutch fetch/parse handle crawl/segments directory
 -

 Key: NUTCH-1001
 URL: https://issues.apache.org/jira/browse/NUTCH-1001
 Project: Nutch
  Issue Type: Improvement
Reporter: Gabriele Kahlout
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1001.patch


 I'm having issues porting scripts across different systems to support the 
 step of extracting the latest/only segments resulting from the generate phase.
 Variants include:
 $ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` #[1]
 $ s1=`ls -d crawl/segments/2* | tail -1` #[2]
 $ segment=`$HADOOP_HOME/bin/hadoop dfs -ls crawl/segments | tail -1 | grep -o 
 [a-zA-Z0-9/\-]* |tail -1`
 $ segment=`$HADOOP_HOME/bin/hdfs -ls crawl/segments | tail -1 | grep -o 
 [a-zA-Z0-9/\-]* |tail -1`
 And I'm not sure what windows users would have to do. Some users may also do 
 with:
 bin/nutch fetch with crawl/segments/2*
 But I don't see a need in having the user extract/worry-about the latest/only 
 segment, and have it a described step in every nutch tutorial. More over only 
 fetch and parse expect a segment while other commands are fine with the 
 directory of segments.
 Therefore, I think it's beneficial if fetch and parse also handle directories 
 of segments. 
 [1] http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
 [2] http://wiki.apache.org/nutch/NutchTutorial#Command_Line_Searching

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1079) StringBuffer converted to StringBuilder

2012-02-17 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210516#comment-13210516
 ] 

Lewis John McGibbney commented on NUTCH-1079:
-

How is this guys? It seems that there was a level of agreement wrt appends over 
concats, but the patch/issue never seemed to get updated and has now stagnated 
slightly. Any chance of reviving the patient?

 StringBuffer converted to StringBuilder
 ---

 Key: NUTCH-1079
 URL: https://issues.apache.org/jira/browse/NUTCH-1079
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, indexer
Reporter: Karthik K
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1079.patch, NUTCH-rel_14-1079.patch


 All across the codebase, it contains StringBuffer, when thread-safety is 
 probably not intended. 
 This patch replaces StringBuffer to StringBuilder, as applicable. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1210) DomainBlacklistFilter

2012-02-17 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210529#comment-13210529
 ] 

Lewis John McGibbney commented on NUTCH-1210:
-

One last thing, I think your patch requires
{code}
ant dir=urlfilter-domainblacklist target=deploy/test/clean/
{code}
in src/plugin/build.xml. Thanks

 DomainBlacklistFilter
 -

 Key: NUTCH-1210
 URL: https://issues.apache.org/jira/browse/NUTCH-1210
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1210-1.5-1.patch


 The current DomainFilter acts as a white list. We also need a filter that 
 acts as a black list so we can allow tld's and/or domains with DomainFilter 
 but blacklist specific subdomains. If we would patch the current DomainFilter 
 for this behaviour it would break current semantics such as it's precedence. 
 Therefore i would propose a new filter instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1246) Upgrade to Hadoop 1.0.0

2012-02-17 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210534#comment-13210534
 ] 

Lewis John McGibbney commented on NUTCH-1246:
-

Removal of jackson library in ivy/ivy.xml committed @ revision 1245749 in trunk

 Upgrade to Hadoop 1.0.0
 ---

 Key: NUTCH-1246
 URL: https://issues.apache.org/jira/browse/NUTCH-1246
 Project: Nutch
  Issue Type: Improvement
Affects Versions: nutchgora, 1.5
Reporter: Julien Nioche



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1193) Incorrect url transform to lowercase: parameter solr

2012-02-17 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210542#comment-13210542
 ] 

Lewis John McGibbney commented on NUTCH-1193:
-

Committed @ revision 1245753 in trunk.
Thank you Eduardo for reporting.


 Incorrect url transform to lowercase: parameter solr
 

 Key: NUTCH-1193
 URL: https://issues.apache.org/jira/browse/NUTCH-1193
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Eduardo dos Santos Leggiero
Priority: Trivial
  Labels: crawling
 Fix For: 1.5




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1279) Check if limit has been reached in GeneraterReducer must be the first check performance-wise.

2012-02-15 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208341#comment-13208341
 ] 

Lewis John McGibbney commented on NUTCH-1279:
-

Hi Ferdy, have you checked whether this is the case in trunk as well? I know 
the fetcher architecture is slightly different.  

 Check if limit has been reached in GeneraterReducer must be the first check 
 performance-wise.
 -

 Key: NUTCH-1279
 URL: https://issues.apache.org/jira/browse/NUTCH-1279
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Ferdy Galema
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1279.txt


 The (count = limit) should be put up front in the reduce method of the 
 generator, because that way when the limit is reached the reduce method will 
 return faster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host

2012-02-15 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208413#comment-13208413
 ] 

Lewis John McGibbney commented on NUTCH-1278:
-

Hi Behnam. Do you have a patch for trunk? Thank you

 Fetch Improvement in threads per host
 -

 Key: NUTCH-1278
 URL: https://issues.apache.org/jira/browse/NUTCH-1278
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht

 the value of maxThreads is equal to fetcher.threads.per.host and is constant 
 for every host
 there is a possibility with using of dynamic values for every host that 
 influeced with number of blocked requests.
 this means that if number of blocked requests for one host increased, then we 
 most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1210) DomainBlacklistFilter

2012-02-15 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208459#comment-13208459
 ] 

Lewis John McGibbney commented on NUTCH-1210:
-

This looks really nice Markus. I like the documentation and test as well. I 
would like to try it out with another couple of test scenarios before passing 
my full opinion, which I will be able to do this afternoon. 

 DomainBlacklistFilter
 -

 Key: NUTCH-1210
 URL: https://issues.apache.org/jira/browse/NUTCH-1210
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1210-1.5-1.patch


 The current DomainFilter acts as a white list. We also need a filter that 
 acts as a black list so we can allow tld's and/or domains with DomainFilter 
 but blacklist specific subdomains. If we would patch the current DomainFilter 
 for this behaviour it would break current semantics such as it's precedence. 
 Therefore i would propose a new filter instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1210) DomainBlacklistFilter

2012-02-15 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208881#comment-13208881
 ] 

Lewis John McGibbney commented on NUTCH-1210:
-

Hi Markus. 
1) I would ask one tiny change in ivy.xml
from
{code}
  configurations
include file=${nutch.root}/ivy/ivy-configurations.xml/
  /configurations
{code}
to
{code}
  configurations
include file=../../..//ivy/ivy-configurations.xml/
  /configurations
{code}
this is purely for consistency as I think it's easier to configure in Eclipse 
as the ${nutch.root} variable hasn't been specified.

2) Also domainblacklist-urlfilter.txt is not included in the patch under /conf. 
Would it be possible to have a file there with some commented out documentation 
so users at least have something to go on?

3) Your documentation in the main class also mentions that the property can be 
overridden in nutch-*.xml, however no property exists in nutch-default for 
people to go on meaning that it is likely people will become confused when 
trying to set the property from nutch-site.xml.

My tests seemt obe failing with trunk therefore there is something up with my 
trunk co, so I'll go get that sorted then test a bit more. Thanks  



 DomainBlacklistFilter
 -

 Key: NUTCH-1210
 URL: https://issues.apache.org/jira/browse/NUTCH-1210
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1210-1.5-1.patch


 The current DomainFilter acts as a white list. We also need a filter that 
 acts as a black list so we can allow tld's and/or domains with DomainFilter 
 but blacklist specific subdomains. If we would patch the current DomainFilter 
 for this behaviour it would break current semantics such as it's precedence. 
 Therefore i would propose a new filter instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1222) Upgrade to new Hadoop 0.22.0

2012-02-14 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207730#comment-13207730
 ] 

Lewis John McGibbney commented on NUTCH-1222:
-

Is this necessary anymore Markus now that we are using 1.0.0?

 Upgrade to new Hadoop 0.22.0
 

 Key: NUTCH-1222
 URL: https://issues.apache.org/jira/browse/NUTCH-1222
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Critical
 Fix For: 1.5

 Attachments: NUTCH-1222-1.5-1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1222) Upgrade to new Hadoop 0.22.0

2012-02-14 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207754#comment-13207754
 ] 

Lewis John McGibbney commented on NUTCH-1222:
-

Hey Markus it was more a question rather than anything else really. I 
personally don't have much of a preference as I'm not following the Hadoop 
project decisions as closely as others, therefore I don't know intricate 
differences in development tracks if I'm honest. Maybe you may wish to keep 
this open or something. Up to you I guess :0)

 Upgrade to new Hadoop 0.22.0
 

 Key: NUTCH-1222
 URL: https://issues.apache.org/jira/browse/NUTCH-1222
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Critical
 Fix For: 1.5

 Attachments: NUTCH-1222-1.5-1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml

2012-02-13 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206933#comment-13206933
 ] 

Lewis John McGibbney commented on NUTCH-1205:
-

OK when I apply the patch, I'm seeing 

{code}
[ivy:resolve] :: problems summary ::
[ivy:resolve]  WARNINGS
[ivy:resolve]   [FAILED ] 
maven-plugins#maven-cobertura-plugin;1.3!maven-cobertura-plugin.plugin:  (0ms)
[ivy:resolve]    local: tried
[ivy:resolve] 
/home/lewis/.ivy2/local/maven-plugins/maven-cobertura-plugin/1.3/plugins/maven-cobertura-plugin.plugin
[ivy:resolve]    maven2: tried
[ivy:resolve] 
http://repo1.maven.org/maven2/maven-plugins/maven-cobertura-plugin/1.3/maven-cobertura-plugin-1.3.plugin
[ivy:resolve]    apache-snapshot: tried
[ivy:resolve] 
http://repository.apache.org/content/groups/snapshots-group/maven-plugins/maven-cobertura-plugin/1.3/maven-cobertura-plugin-1.3.plugin
[ivy:resolve]   [FAILED ] 
maven-plugins#maven-findbugs-plugin;1.3.1!maven-findbugs-plugin.plugin:  (0ms)
[ivy:resolve]    local: tried
[ivy:resolve] 
/home/lewis/.ivy2/local/maven-plugins/maven-findbugs-plugin/1.3.1/plugins/maven-findbugs-plugin.plugin
[ivy:resolve]    maven2: tried
[ivy:resolve] 
http://repo1.maven.org/maven2/maven-plugins/maven-findbugs-plugin/1.3.1/maven-findbugs-plugin-1.3.1.plugin
[ivy:resolve]    apache-snapshot: tried
[ivy:resolve] 
http://repository.apache.org/content/groups/snapshots-group/maven-plugins/maven-findbugs-plugin/1.3.1/maven-findbugs-plugin-1.3.1.plugin
[ivy:resolve]   ::
[ivy:resolve]   ::  FAILED DOWNLOADS::
[ivy:resolve]   :: ^ see resolution messages for details  ^ ::
[ivy:resolve]   ::
[ivy:resolve]   :: 
maven-plugins#maven-cobertura-plugin;1.3!maven-cobertura-plugin.plugin
[ivy:resolve]   :: 
maven-plugins#maven-findbugs-plugin;1.3.1!maven-findbugs-plugin.plugin
[ivy:resolve]   ::

{code}

There is a really weird extension for the plugins e.g. 
{code}
maven-cobertura-plugin.plugin
{code}

I've tried excluding these as both individual exclusions for the Gora atrifacts 
and as global exclusions for maven-plugins but noneof this works. Been doing 
some reading on ivysettings on the ant/ivy website but there is a bit fo 
documentation so it's taking a while.

 Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml
 ---

 Key: NUTCH-1205
 URL: https://issues.apache.org/jira/browse/NUTCH-1205
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, 
 NUTCH-1205-v4.patch, NUTCH-1205.patch


 Although gora trunk is unstable, work is ongoing to get this fixed. For the 
 time being, I think Nutchgora should use gora trunk as this will identify 
 more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml

2012-02-13 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206939#comment-13206939
 ] 

Lewis John McGibbney commented on NUTCH-1205:
-

To add to this, I can confirm that we are pulling the most up to date maven 
artifacts from the apache snapshots repository, so at least we are using 
bleeding edge here. 

 Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml
 ---

 Key: NUTCH-1205
 URL: https://issues.apache.org/jira/browse/NUTCH-1205
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, 
 NUTCH-1205-v4.patch, NUTCH-1205.patch


 Although gora trunk is unstable, work is ongoing to get this fixed. For the 
 time being, I think Nutchgora should use gora trunk as this will identify 
 more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml

2012-02-13 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206961#comment-13206961
 ] 

Lewis John McGibbney commented on NUTCH-1205:
-

Yeah it's another kettle of fish altogether. I'll get on it and hopefully get 
it sorted out. I'll ensure that the final patch includes the hsqldb upgrage as 
well Ferdy. Thanks for now.

 Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml
 ---

 Key: NUTCH-1205
 URL: https://issues.apache.org/jira/browse/NUTCH-1205
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, 
 NUTCH-1205-v4.patch, NUTCH-1205.patch


 Although gora trunk is unstable, work is ongoing to get this fixed. For the 
 time being, I think Nutchgora should use gora trunk as this will identify 
 more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2012-02-10 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205750#comment-13205750
]

Lewis John McGibbney commented on NUTCH-1129:
-

Hi Markus. I'm really gutted about this one, I've not had time to sort it out.
I want to say the following things though.
- Any23 is now available on repository.apache.org [1], however I think we need
to change our ivy resolver to fetch these 0.7.0-snapshots. Should be pretty
trivial though.
- Any23 already has a crawler plugin implementation (nothing like the stuff we
offer in Nutch ;0)) I'm not aware of the code, but it might be worth a swatch?
[2] Unfortunately the documentation is not great at all as I'm sure you'll
agree.

[1] https://repository.apache.org/index.html#nexus-search;quick~org.apache.any23
[2] https://svn.apache.org/viewvc/incubator/any23/trunk/plugins/basic-crawler/

Any23 Nutch plugin
--

Key: NUTCH-1129
URL: https://issues.apache.org/jira/browse/NUTCH-1129
Project: Nutch
Issue Type: New Feature
Components: parser
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
Fix For: 1.5

This plugin should build on the Any23 library to provide us with a plugin
which extracts RDF data from HTTP and file resources. Although as of writing
Any23 not part of the ASF, the project is working towards integration into
the Apache Incubator. Once the project proves its value, this would be an
excellent addition to the Nutch 1.X codebase.

[jira] [Commented] (NUTCH-1269) Generate main problems

2012-02-08 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203453#comment-13203453
 ] 

Lewis John McGibbney commented on NUTCH-1269:
-

Hi Behnam. Can you please package the above code as a patch against 1.5 
(trunk). That way we can try it if we get time. Thank you

 Generate main problems
 --

 Key: NUTCH-1269
 URL: https://issues.apache.org/jira/browse/NUTCH-1269
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht
  Labels: Generate, MaxHostCount, MaxNumSegments

 there are some problems with current Generate method, with maxNumSegments and 
 maxHostCount options:
 1. first, size of generated segments are different
 2. with maxHostCount option, it is unclear that it was applied or not
 3. urls from one host are distributed non-uniform between segments
 we change Generator.java as described below:
 in Selector class:
 private int maxNumSegments;
 private int segmentSize;
 private int maxHostCount;
 public void config
 ...
   maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
   segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments;
   maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100);  
 ...
 public void reduce(FloatWritable key, IteratorSelectorEntry values,
 OutputCollectorFloatWritable,SelectorEntry output, Reporter 
 reporter)
 throws IOException {
   int limit2=(int)((limit*3)/2);
   while (values.hasNext()) {
   if(count == limit)
 break;
 if (count % segmentSize == 0 ) {
   if (currentsegmentnum  maxNumSegments-1){
 currentsegmentnum++;
   }
   else
 currentsegmentnum=0;
 }
 boolean full=true;
 for(int jk=0;jkmaxNumSegments;jk++){
   if (segCounts[jk]segmentSize){
   full=false;
   }
 }
 if(full){
   break;
 }
 SelectorEntry entry = values.next();
 Text url = entry.url;
 //logWrite(Generated3:+limit+-+count+-+url.toString());
 String urlString = url.toString();
 URL u = null;
 String hostordomain = null;
 try {
   if (normalise  normalizers != null) {
 urlString = normalizers.normalize(urlString,
 URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
   }

   u = new URL(urlString);
   if (byDomain) {
 hostordomain = URLUtil.getDomainName(u);
   } else {
 hostordomain = new URL(urlString).getHost();
   }
  
   hostordomain = hostordomain.toLowerCase();
 boolean countLimit=true;
 // only filter if we are counting hosts or domains
  int[] hostCount = hostCounts.get(hostordomain);
  //host count: {a,b,c,d} means that from this host there are a 
 urls in segment 0 and b urls in seg 1 and ...
  if (hostCount == null) {
  hostCount = new int[maxNumSegments];
  for(int kl=0;klhostCount.length;kl++)
  hostCount[kl]=0;
  hostCounts.put(hostordomain, hostCount);
  }  
  int selectedSeg=currentsegmentnum;
  int minCount=hostCount[selectedSeg];
  for(int jk=0;jkmaxNumSegments;jk++){
  if(hostCount[jk]minCount){
  minCount=hostCount[jk];
  selectedSeg=jk;
  }
 }
 if(hostCount[selectedSeg]=maxHostCount){
 count++;
 entry.segnum = new IntWritable(selectedSeg);
 hostCount[selectedSeg]++;
 output.collect(key, entry);
 }
 } catch (Exception e) {
   LOG.warn(Malformed URL: ' + urlString + ', skipping (
 logWrite(Generate-malform:+hostordomain+-+url.toString());
   + StringUtils.stringifyException(e) + ));
   //continue;
 }
   }
 }
 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1270) some of Deflate encoded pages not fetched

2012-02-08 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203515#comment-13203515
 ] 

Lewis John McGibbney commented on NUTCH-1270:
-

Hi Benham, again thanks for opening this ticket, but could you possibly patch 
this against trunk (1.5)? Thankyou

 some of Deflate encoded pages not fetched
 -

 Key: NUTCH-1270
 URL: https://issues.apache.org/jira/browse/NUTCH-1270
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht
  Labels: fetch, processDeflateEncoded

 it is a problem with some of web pages that fetched but their content can not 
 retrived
 after this change, this error fixed
 we change lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
   public byte[] processDeflateEncoded(byte[] compressed, URL url) throws 
 IOException {
 if (LOGGER.isTraceEnabled()) { LOGGER.trace(inflating); }
 byte[] content = DeflateUtils.inflateBestEffort(compressed, 
 getMaxContent());
 +if(content==null)
 + content = DeflateUtils.inflateBestEffort(compressed, 20);
 if (content == null)
   throw new IOException(inflateBestEffort returned null);
 if (LOGGER.isTraceEnabled()) {
   LOGGER.trace(fetched  + compressed.length
  +  bytes of compressed content (expanded to 
  + content.length +  bytes) from  + url);
 }
 return content;
   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

2012-02-07 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202603#comment-13202603
 ] 

Lewis John McGibbney commented on NUTCH-1259:
-

Hey Markus. I'm literally up to my eye balls with stuff the now so sorry for 
not having the time to look through your work. The best I can do is have a look 
tomorrow, I'll give it my all then. Thanks

 TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
 --

 Key: NUTCH-1259
 URL: https://issues.apache.org/jira/browse/NUTCH-1259
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1259-1.5-1.patch


 The MIME-type detected by Tika's Detect() API is never added to a Parse's 
 ContentMetaData or ParseMetaData. Because of this bad Content-Types will end 
 up in the documents. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field

2012-02-05 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200716#comment-13200716
 ] 

Lewis John McGibbney commented on NUTCH-1140:
-

Hi Joe. This one seems to have slipped under the radar somewhat!
Can you please attach a patch under 1.5 (trunk) please ?
Thank you if this is possible.

 index-more plugin, resetTitle method creates multiple values in the Title 
 field
 ---

 Key: NUTCH-1140
 URL: https://issues.apache.org/jira/browse/NUTCH-1140
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.3
Reporter: Joe Liedtke
Priority: Minor
 Fix For: 1.5

 Attachments: MoreIndexingFilter.093011.patch


 From the comments in MoreIndexingFilter.java, the index-more plugin is meant 
 to reset the Title field of a document if it contains a Content-Disposition 
 header. The current behavior is to add a Title regardless of whether one 
 exists or not, which can cause issues down the line with the Solr Indexing 
 process, and based on a thread in the nutch user list it appears that this is 
 causing some users to mark the title as multi-valued in the schema:
   
 http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8
 The following patch removes the title field before adding a new one, which 
 has resolved the issue for me:
 --- MoreIndexingFilter.old2011-09-30 11:44:35.0 +
 +++ MoreIndexingFilter.java   2011-09-30 09:58:48.0 +
 @@ -276,6 +276,7 @@
  for (int i=0; ipatterns.length; i++) {
if (matcher.contains(contentDisposition,patterns[i])) {
  result = matcher.getMatch();
 +doc.removeField(title);
  doc.add(title, result.group(1));
  break;
}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1256) WebGraph to dump host + score

2012-01-31 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196894#comment-13196894
 ] 

Lewis John McGibbney commented on NUTCH-1256:
-

I like this Markus. I wonder if it is possible for you to add some in-line 
documentation? Or Javadoc, depends on what you wish. Also if you get time, is 
it possible for this to be added here
http://wiki.apache.org/nutch/bin/nutch%20nodedumper



 WebGraph to dump host + score
 -

 Key: NUTCH-1256
 URL: https://issues.apache.org/jira/browse/NUTCH-1256
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1256-1.5-1.patch


 WebGraph's NodeDumper tool can dump url,score information but a 
 host|domain,score output can also be put to good use. This is likely to 
 require a new MapReduce job as the NodeDumper's atonomy is not suited to 
 return max or or summed scores. Code could also be merged with the tool.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1081) ant tests fail

2012-01-31 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197288#comment-13197288
 ] 

Lewis John McGibbney commented on NUTCH-1081:
-

Hi Ferdy. Have you noticed anything dodgy with this?

 ant tests fail 
 ---

 Key: NUTCH-1081
 URL: https://issues.apache.org/jira/browse/NUTCH-1081
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator, injector, storage
Affects Versions: nutchgora
 Environment: Ubuntu release 11.04 (natty)
 Kernerl Linux 2.6.38-10-generic
 GNOME 2.32.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: nutchgora


 The following tests fail when running ant test on trunk 2.0
 {code}
 [junit] Running org.apache.nutch.api.TestAPI
 [junit] Tests run: 4, Failures: 1, Errors: 0, Time elapsed: 11.028 sec
 [junit] Test org.apache.nutch.api.TestAPI FAILED
 [junit] Running org.apache.nutch.crawl.TestGenerator
 [junit] Tests run: 4, Failures: 0, Errors: 4, Time elapsed: 0.478 sec
 [junit] Test org.apache.nutch.crawl.TestGenerator FAILED
 [junit] Running org.apache.nutch.crawl.TestInjector
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.474 sec
 [junit] Test org.apache.nutch.crawl.TestInjector FAILED
 [junit] Running org.apache.nutch.fetcher.TestFetcher
 [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.526 sec
 [junit] Test org.apache.nutch.fetcher.TestFetcher FAILED
 [junit] Running org.apache.nutch.storage.TestGoraStorage
 [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.468 sec
 [junit] Test org.apache.nutch.storage.TestGoraStorage FAILED
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

2012-01-21 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190545#comment-13190545
 ] 

Lewis John McGibbney commented on NUTCH-965:


Hi can anyone advise if I should be looking @ ParseUtil class in trunk? I'm a 
bit confused and Eclipse doesn't seem to be helping out much.

 Skip parsing for truncated documents
 

 Key: NUTCH-965
 URL: https://issues.apache.org/jira/browse/NUTCH-965
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Alexis
Assignee: Lewis John McGibbney
 Fix For: nutchgora, 1.5

 Attachments: NUTCH-965-v2.patch, parserJob.patch


 The issue you're likely to run into when parsing truncated FLV files is 
 described here:
 http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
 The parser library gets stuck in infinite loop as it encounters corrupted 
 data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1254) NTLMv2 is not supported and HttpClient returns error code 500

2012-01-18 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188491#comment-13188491
 ] 

Lewis John McGibbney commented on NUTCH-1254:
-

Hi Remi are you able to provide a patch for trunk which either recognizes 
whether to use NTML2 dynamically or maybe sets this as a boolean value in 
nutch-default.xml? 

 NTLMv2 is not supported and HttpClient returns error code 500
 -

 Key: NUTCH-1254
 URL: https://issues.apache.org/jira/browse/NUTCH-1254
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Remi Tassing
Priority: Minor

 When trying to access some SharePoint(IIS) website using NTLMv2 
 authentication, Nutch fails and gets an error code 500. HttpClient only 
 supports an early version of NTLM but not NTLMv2. HttpUrlConnection can be 
 used instead.
 [1]http://oaklandsoftware.com/papers/ntlm.html
 [2]http://developer-resource.blogspot.com/2008/06/ntlm-authentication-from-java.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient

2012-01-18 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188505#comment-13188505
 ] 

Lewis John McGibbney commented on NUTCH-1086:
-

When trying to access some SharePoint(IIS) website using NTLMv2 authentication, 
Nutch fails and gets an error code 500. HttpClient only supports an early 
version of NTLM but not NTLMv2. HttpUrlConnection can be used instead.

[1]http://oaklandsoftware.com/papers/ntlm.html
[2]http://developer-resource.blogspot.com/2008/06/ntlm-authentication-from-java.html


 Rewrite protocol-httpclient
 ---

 Key: NUTCH-1086
 URL: https://issues.apache.org/jira/browse/NUTCH-1086
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Markus Jelsma

 There are several issues about protocol-httpclient and several comments about 
 rewriting the plugin with the new http client libraries. There is, however, 
 not yet an issue for rewriting/reimplementing protocol-httpclient.
 http://hc.apache.org/httpcomponents-client-ga/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1246) Upgrade to Hadoop 1.0.0

2012-01-13 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185562#comment-13185562
 ] 

Lewis John McGibbney commented on NUTCH-1246:
-

Subtask?

 Upgrade to Hadoop 1.0.0
 ---

 Key: NUTCH-1246
 URL: https://issues.apache.org/jira/browse/NUTCH-1246
 Project: Nutch
  Issue Type: Improvement
Affects Versions: nutchgora, 1.5
Reporter: Julien Nioche



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

2012-01-13 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185844#comment-13185844
 ] 

Lewis John McGibbney commented on NUTCH-1247:
-

Where in CrawlDatum is the CrawlDBReader map method on line 159 getting the 
RetriesSinceFetch() from? 
{code}
output.collect(new Text(retry  + value.getRetriesSinceFetch()), COUNT_1);
{code}

Also, excuse my naivety but can you be more verbose about why the byte value 
for CrawlDatum.retries goes bad?

 CrawlDatum.retries should be int
 

 Key: NUTCH-1247
 URL: https://issues.apache.org/jira/browse/NUTCH-1247
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.5


 CrawlDatum.retries is a byte and goes bad with larger values.
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-809) Parse-metatags plugin

2012-01-12 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184880#comment-13184880
 ] 

Lewis John McGibbney commented on NUTCH-809:


Hi Elisabeth although I haven't had time to look through your zip yet a big 
thank you must be aimed your way. If you have time and are willing please 
create a new page on the Nutch wiki under plugin central. As you can see this 
issue is closely linked to some others of similar nature so it may/may not 
change in the future, however community driven documentation is exactly what we 
are after and it is greatly welcomed.

Please contact me off list or @ dev@ with your wiki username and I will add you 
to a the wiki contributers page.

Thank you 

[1] http://wiki.apache.org/nutch/PluginCentral 

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4, nutchgora
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.5

 Attachments: NUTCH-809.patch, NUTCH-809_metatags_1.3.patch, 
 metatags-plugin+tutorial.zip


 h2. Parse-metatags plugin
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The query-basic plugin is used to include these fields in the search e.g. in 
 nutch-site.xml
 {code:xml}
 property
   namequery.basic.description.boost/name
   value2.0/value
 /property
 property
   namequery.basic.keywords.boost/name
   value2.0/value
 /property
 {code}
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2012-01-12 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185266#comment-13185266
 ] 

Lewis John McGibbney commented on NUTCH-797:


This has been committed but the issue is still open and marked as unresolved. 
I've just spent around 30 mins looking through the three open issues closely 
surrounding this problem area with constructing outlinks beginning with ?'s. I 
think that we need to have a close look to try and sort the three issues out.  

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2012-01-12 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185285#comment-13185285
 ] 

Lewis John McGibbney commented on NUTCH-1031:
-

Hi Julien, out of shear curiosity, how do we currently parse robots.txt? I 
found some files (which don't do parsing) in o.a.n.protocol but I've never 
known what we use for robots.txt

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
  Labels: robots.txt
 Fix For: 1.5


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1237) Improve javac arguements for more verbose output

2012-01-05 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13180357#comment-13180357
 ] 

Lewis John McGibbney commented on NUTCH-1237:
-

Any problems with committing this one? All local tests pass as per Julien's 
committs to fix the nightly build. It's providing us with a wealth of info of 
where the code can be improved.

 Improve javac arguements for more verbose output 
 -

 Key: NUTCH-1237
 URL: https://issues.apache.org/jira/browse/NUTCH-1237
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: nutchgora, 1.5

 Attachments: NUTCH-1237-nutchgora.patch, NUTCH-1237-trunk.patch, 
 NUTCH-1237-trunk.patch


 When trying to fix another problem I stumbled across this one. I think it is 
 important to ensure that the javac outputs info regarding deprecation and 
 unchecked operations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-926) Nutch follows wrong url in META http-equiv=refresh tag

2012-01-05 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13180556#comment-13180556
]

Lewis John McGibbney commented on NUTCH-926:

Hey guys, just looking at our critical issues and hadn't noticed this one
previously, did anyone have a look at this issue and can we reproduce?

Nutch follows wrong url in META http-equiv=refresh tag
-

Key: NUTCH-926
URL: https://issues.apache.org/jira/browse/NUTCH-926
Project: Nutch
Issue Type: Bug
Components: parser
Affects Versions: 1.2
Environment: gnu/linux centOs
Reporter: Marco Novo
Priority: Critical
Attachments: ParseOutputFormat.java.patch

We have nutch set to crawl a domain urllist and we want to fetch only passed
domains (hosts) not subdomains.
So
WWW.DOMAIN1.COM
..
..
..
WWW.RIGHTDOMAIN.COM
..
..
..
..
WWW.DOMAIN.COM
We sets nutch to:
NOT FOLLOW EXERNAL LINKS
During crawling of WWW.RIGHTDOMAIN.COM
if a page contains
!DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
html
head
title/title
META http-equiv=refresh content=0;
url=http://WRONG.RIGHTDOMAIN.COM;
/head
body
/body
/html
Nutch continues to crawl the WRONG subdomains! But it should not do this!!
During crawling of WWW.RIGHTDOMAIN.COM
if a page contains
!DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
html
head
title/title
META http-equiv=refresh content=0;
url=http://WWW.WRONGDOMAIN.COM;
/head
body
/body
/html
Nutch continues to crawl the WRONG domain! But it should not do this! If that
we will spider all the web
We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have
done a patch so we will attach it

[jira] [Commented] (NUTCH-874) Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora

2012-01-05 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13180562#comment-13180562
 ] 

Lewis John McGibbney commented on NUTCH-874:


I know the heat has kind of shifted away from Nutchgora but it would be great 
to clarify what this issues actually encapsulates. Was/is it is the case that 
some plugins in Nutchgora are not actually working with the Nutchgora API? I 
kinda confused with this one! 

 Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora
 --

 Key: NUTCH-874
 URL: https://issues.apache.org/jira/browse/NUTCH-874
 Project: Nutch
  Issue Type: Bug
  Components: parser
 Environment: Nutch 2.0
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Critical
 Fix For: nutchgora


 I just noticed while fixing NUTCH-564 that the ExtParser hasn't been brought 
 up to date with Nutch 2.0 trunk. We should review the plugins in src/plugin 
 to make sure they all work with Gora/Nutchbase now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1138) remove LogUtil from trunk and nutch gora

2012-01-03 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13178719#comment-13178719
 ] 

Lewis John McGibbney commented on NUTCH-1138:
-

Hey Markus, it's been committed in trunk but I was wanting to get on with the 
nutchgora patch asap. Leave it with me and I'll commit and close shortly. Thank 
you

 remove LogUtil from trunk and nutch gora
 

 Key: NUTCH-1138
 URL: https://issues.apache.org/jira/browse/NUTCH-1138
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.5

 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch


 This should move towards the removal of the LogUtil class from both codebases 
 as per comments in NUTCH-1078.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1237) Improve javac arguements for more verbose output

2011-12-28 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176605#comment-13176605
 ] 

Lewis John McGibbney commented on NUTCH-1237:
-

Revised patch for trunk, I forget something in the first one. Thanks

 Improve javac arguements for more verbose output 
 -

 Key: NUTCH-1237
 URL: https://issues.apache.org/jira/browse/NUTCH-1237
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: nutchgora, 1.5

 Attachments: NUTCH-1237-nutchgora.patch, NUTCH-1237-trunk.patch, 
 NUTCH-1237-trunk.patch


 When trying to fix another problem I stumbled across this one. I think it is 
 important to ensure that the javac outputs info regarding deprecation and 
 unchecked operations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1138) remove LogUtil from trunk and nutch gora

2011-12-27 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176192#comment-13176192
 ] 

Lewis John McGibbney commented on NUTCH-1138:
-

Partly committed @ revision 1224917 in trunk
Fully committed @ revision 1224919 in trunk (second commit removed LogUtil 
altogether).

 remove LogUtil from trunk and nutch gora
 

 Key: NUTCH-1138
 URL: https://issues.apache.org/jira/browse/NUTCH-1138
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.5

 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch


 This should move towards the removal of the LogUtil class from both codebases 
 as per comments in NUTCH-1078.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1237) Improve javac arguements for more verbose output

2011-12-27 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176220#comment-13176220
 ] 

Lewis John McGibbney commented on NUTCH-1237:
-

If I can get a +1 I'll commit. Thank you

 Improve javac arguements for more verbose output 
 -

 Key: NUTCH-1237
 URL: https://issues.apache.org/jira/browse/NUTCH-1237
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: nutchgora, 1.5

 Attachments: NUTCH-1237-nutchgora.patch, NUTCH-1237-trunk.patch


 When trying to fix another problem I stumbled across this one. I think it is 
 important to ensure that the javac outputs info regarding deprecation and 
 unchecked operations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2-incubating in ivy/ivy.xml

2011-12-27 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176224#comment-13176224
 ] 

Lewis John McGibbney commented on NUTCH-1205:
-

for reference to fix the above problems
http://stackoverflow.com/questions/197986/what-causes-javac-to-issue-the-uses-unchecked-or-unsafe-operations-warning

 Upgrade gora modules to 0.2-incubating in ivy/ivy.xml
 -

 Key: NUTCH-1205
 URL: https://issues.apache.org/jira/browse/NUTCH-1205
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1205-v2.patch, NUTCH-1205.patch


 Although gora trunk is unstable, work is ongoing to get this fixed. For the 
 time being, I think Nutchgora should use gora trunk as this will identify 
 more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1217) Update NOTICE.txt to drop some copyrights

2011-12-26 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175949#comment-13175949
]

Lewis John McGibbney commented on NUTCH-1217:
-

Hi Guys, as I've looked deeper in to this the first patch is a load of dribble.
As we are pulling the overwhelming majority of our dependencies from upstream
repositories using Ivy, there is no need to include them in the NOTICE.txt
declarations. We only ship with JavaSWF Automaton libraries (both of which
are plugins). I'll commit this, do the same for Nutchgora then shut this one
off.

One last question, is anyone aware if our licences for the above two packages
are OK? I am not aware but I am more than happy to have a word with the authors
to find out. Thanks

Update NOTICE.txt to drop some copyrights
-

Key: NUTCH-1217
URL: https://issues.apache.org/jira/browse/NUTCH-1217
Project: Nutch
Issue Type: Improvement
Components: documentation
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Fix For: nutchgora, 1.5

Attachments: NUTCH-1217-trunk.patch

We have many references to software copyrights which should be dropped. Most
of these relate to the Lucene legacy days.
-Carrot2
-saxpath
-jaxen
-jdom
-snowball
-violinstrings
-Jena
-bouncycastle
-fontbox
-jempbox
-pdfbox
-rome
Also some need to be added
-slf4j
-activation
-mortbay (jetty)
-jline
-junit
-stax
-wstx
As I am unfamiliar with most of these, and that is important to inlcude all
references to software outside of the ASF, I would appreciate if this list
could act as a beginning for completing this issue.

[jira] [Commented] (NUTCH-1081) ant tests fail

2011-12-26 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175956#comment-13175956
 ] 

Lewis John McGibbney commented on NUTCH-1081:
-

Hi Ferdy. There has been almost no problems within the CI testing environment 
for a number of weeks/months. Any failures seem to have been down to the 
project building on Ubuntu slaves as oppose to Solaris slaves, the failures are 
a result of incorrect envars being specified. I've added some more 
functionality to the nutchgora build characteristics e.g. Publish JUnit test 
result report and publish Javadoc. So as agreed we will keep an eye on this.   

 ant tests fail 
 ---

 Key: NUTCH-1081
 URL: https://issues.apache.org/jira/browse/NUTCH-1081
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator, injector, storage
Affects Versions: nutchgora
 Environment: Ubuntu release 11.04 (natty)
 Kernerl Linux 2.6.38-10-generic
 GNOME 2.32.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: nutchgora


 The following tests fail when running ant test on trunk 2.0
 {code}
 [junit] Running org.apache.nutch.api.TestAPI
 [junit] Tests run: 4, Failures: 1, Errors: 0, Time elapsed: 11.028 sec
 [junit] Test org.apache.nutch.api.TestAPI FAILED
 [junit] Running org.apache.nutch.crawl.TestGenerator
 [junit] Tests run: 4, Failures: 0, Errors: 4, Time elapsed: 0.478 sec
 [junit] Test org.apache.nutch.crawl.TestGenerator FAILED
 [junit] Running org.apache.nutch.crawl.TestInjector
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.474 sec
 [junit] Test org.apache.nutch.crawl.TestInjector FAILED
 [junit] Running org.apache.nutch.fetcher.TestFetcher
 [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.526 sec
 [junit] Test org.apache.nutch.fetcher.TestFetcher FAILED
 [junit] Running org.apache.nutch.storage.TestGoraStorage
 [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.468 sec
 [junit] Test org.apache.nutch.storage.TestGoraStorage FAILED
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1218) Improve trunk API documentation

2011-12-26 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175966#comment-13175966
 ] 

Lewis John McGibbney commented on NUTCH-1218:
-

Does anyone have any objections for me to hack away at this making commits when 
I can? The intention is to work my way through the core classes, providing a 
description of each package, then get in to more detail with individual classes 
within the 'core' bunch of classes. 

After this I'll move on to the plugin's. 

After that I'll move on to Nutchgora!!!  

 Improve trunk API documentation
 ---

 Key: NUTCH-1218
 URL: https://issues.apache.org/jira/browse/NUTCH-1218
 Project: Nutch
  Issue Type: Sub-task
  Components: documentation
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.5

 Attachments: NUTCH-1218.patch


 The trunk API Java documentation could do with some improving. This issue 
 should track that. It should however not seek to change any functionality 
 within the codebase, only to substantiate and improve the existing 
 documentation.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1218) Improve trunk API documentation

2011-12-26 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175967#comment-13175967
 ] 

Lewis John McGibbney commented on NUTCH-1218:
-

Another thing, even if I make the changes to trunk, it would be great to view 
them dynamically on the trunk Javadoc site [1] e.g. publish them after every 
commit to see the actual changes at incremental stages. Any advice on this? 
From looking at build.xml, it appears that this we only fully publish the 
Javadoc when releasing... Is this the case? If not then can someone please 
advise me otherwise? Thanks guys  

 Improve trunk API documentation
 ---

 Key: NUTCH-1218
 URL: https://issues.apache.org/jira/browse/NUTCH-1218
 Project: Nutch
  Issue Type: Sub-task
  Components: documentation
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.5

 Attachments: NUTCH-1218.patch


 The trunk API Java documentation could do with some improving. This issue 
 should track that. It should however not seek to change any functionality 
 within the codebase, only to substantiate and improve the existing 
 documentation.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1138) remove LogUtil from trunk and nutch gora

2011-12-26 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176046#comment-13176046
 ] 

Lewis John McGibbney commented on NUTCH-1138:
-

Looking for logging irregularities in hadoop.log after running a medium sized 
crawl over mini MR cluster I'm struggling to see any adverse behaviour produced 
as a result of applying this patch. Most WARN's can be attributed to new MR API 
and I've a couple of java.net.SocketException: Connection reset ERRORS, which 
we must expect from time to time :0)

 remove LogUtil from trunk and nutch gora
 

 Key: NUTCH-1138
 URL: https://issues.apache.org/jira/browse/NUTCH-1138
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.5

 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch


 This should move towards the removal of the LogUtil class from both codebases 
 as per comments in NUTCH-1078.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1216) Add trivial comment to lib/native/README.txt

2011-12-06 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163775#comment-13163775
 ] 

Lewis John McGibbney commented on NUTCH-1216:
-

if you's guys are happy to add this then please say.

 Add trivial comment to lib/native/README.txt
 

 Key: NUTCH-1216
 URL: https://issues.apache.org/jira/browse/NUTCH-1216
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Trivial
 Fix For: nutchgora, 1.5

 Attachments: NUTCH-1216-nutchgora.patch, NUTCH-1216-trunk.patch


 This trivial issue simply adds missing comments to the above file. The WARN 
 logging which is churned out has caused a small degree of confusion in the 
 past, therefore this sorts that out :0)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1200) Resolving Ivy dependencies in several plugins

2011-12-01 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13160847#comment-13160847
 ] 

Lewis John McGibbney commented on NUTCH-1200:
-

Hi Blaise I would direct you to this tutorial [1]. It covers everything you 
should need to get Nutch working within your Eclipse IDE. It takes about a half 
hour or so to set up but definitely works as I have been debugging some simple 
jobs from within Eclipse. Please get back to us on the user lists if you are 
having any problems. Thank you

[1] http://wiki.apache.org/nutch/RunNutchInEclipse

 Resolving Ivy dependencies in several plugins 
 --

 Key: NUTCH-1200
 URL: https://issues.apache.org/jira/browse/NUTCH-1200
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.5

 Attachments: NUTCH-1200-trunk.patch, NUTCH-1200-v2-trunk.patch


 When configuring Nutch 1.5-SNAPSHOT in Eclipse, I noticed that any plugins 
 requiring additional libraries OVER AND ABOVE the ones specified in 
 NUTCH_HOME/ivy/ivy.xml cannot resolve the dependencies. In specific the 
 classes are 
 {code}
 - FeedParser dependency org=net.java.dev.rome name=rome rev=1.0.0 
 conf=*-master/
 - URLAutomationFilter - dependency org=dk.brics name=automaton 
 rev=???/
 - SWFParser dependency org=com.google.gwt name=gwt-incubator 
 rev=2.0.1/
 - HTMLParser   dependency org=net.sourceforge.nekohtml name=nekohtml 
 rev=1.9.15/ 
 {code}
 Further to this, I cannot locate the dk.brics dependency!
 Finally, the plugin/ivy.xml files for the above plugins cannot be parsed 
 corectly due to the ${nutch.root} vairable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1210) DomainBlacklistFilter

2011-11-24 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13156805#comment-13156805
 ] 

Lewis John McGibbney commented on NUTCH-1210:
-

Hi Markus I think this is a great idea. It bears *some* similarity to this old 
issue NUTCH-208 and I think it would be an excellent contribution.

 DomainBlacklistFilter
 -

 Key: NUTCH-1210
 URL: https://issues.apache.org/jira/browse/NUTCH-1210
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5


 The current DomainFilter acts as a white list. We also need a filter that 
 acts as a black list so we can allow tld's and/or domains with DomainFilter 
 but blacklist specific subdomains. If we would patch the current DomainFilter 
 for this behaviour it would break current semantics such as it's precedence. 
 Therefore i would propose a new filter instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2-SNAPSHOT

2011-11-23 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13155898#comment-13155898
 ] 

Lewis John McGibbney commented on NUTCH-1205:
-

This can't be progressed with until we get the Gora 0.2-SNAPSHOT's loaded to 

http://repo1.maven.org/maven2/org/apache/gora/

I'll work on this 

 Upgrade gora modules to 0.2-SNAPSHOT
 

 Key: NUTCH-1205
 URL: https://issues.apache.org/jira/browse/NUTCH-1205
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora


 Although gora trunk is unstable, work is ongoing to get this fixed. For the 
 time being, I think Nutchgora should use gora trunk as this will identify 
 more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1189) add commented out default settings to gora.properties files

2011-11-15 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150500#comment-13150500
 ] 

Lewis John McGibbney commented on NUTCH-1189:
-

Yes it certainly should. Couple of things to sort out just now then I'll come 
back to this, get the other properties added in a commented fashion then we 
should be good to fire this one off.
Thanks for now.

 add commented out default settings to gora.properties files 
 

 Key: NUTCH-1189
 URL: https://issues.apache.org/jira/browse/NUTCH-1189
 Project: Nutch
  Issue Type: Sub-task
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: nutchgora

 Attachments: NUTCH-1189-v2.patch, NUTCH-1189-v3.patch, 
 NUTCH-1189.patch


 This issues should have been dealt with as part of its parent issue, however 
 I think as it is a fairly lareg task in itself, it needs to be done 
 independently. The gora.properties file should, amongst other settings, and 
 beside the extreme basic defaults for sqlstore, include defaults for opening 
 HBase, Cassandra, etc servers on their default ports etc. Leaving this down 
 to individual interpretation puts a huge owness of the user, hence 
 constructing a barrier to entry for getting the configuration settings up and 
 running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1148) Nutchgora job jar functionalilty is broken: PluginManifestParser cannot load plugins from system classloader.

2011-11-15 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150502#comment-13150502
]

Lewis John McGibbney commented on NUTCH-1148:
-

I'm happy to see this one go Ferdy. It's been sitting for a while and has
slipped through thr net until now. Any other comments please?

Nutchgora job jar functionalilty is broken: PluginManifestParser cannot load
plugins from system classloader.
-

Key: NUTCH-1148
URL: https://issues.apache.org/jira/browse/NUTCH-1148
Project: Nutch
Issue Type: Bug
Affects Versions: nutchgora
Reporter: Ferdy Galema
Priority: Critical
Attachments: NUTCH-1148-v1.patch

This affects running nutchgora using Hadoop it's RunJar mechanism (hadoop jar
...). The mr tasks are perfectly able to load the plugins (please note
NUTCH-937). But, when the plugins are loaded from the *job submitter* process
itself, loading plugins might fail due to classloading issues. This is caused
by the fact that PluginManifestParser does not use the contextClassLoader
that is set by RunJar. This classloader contains the plugins folder. At least
the FetcherJob is affected by this, because the job submitter uses getFields
of Protocol implementations, therefore loading the plugins.
The current 1.x is not affected because it does not load plugins at any point
during the job submission. This might of course change so I propose to 'fix'
the issue in the 1.x branch as well.
The solution is fairly simple, PluginManifestParser should use the
contextClassLoader of the current thread instead of using the system
classloader. I will attach patch right away. It currently works but it still
needs some further testing.

[jira] [Commented] (NUTCH-1068) Automaton performance improvements based on Lucene code base

2011-11-12 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149214#comment-13149214
]

Lewis John McGibbney commented on NUTCH-1068:
-

Hi Kirby. I understand that this was a while ago now but as no-one has
commented I thought we may as well keep something moving after our conversation
of dev lists. Can you explain how you propose to integrate this into Nutch
code? I am unsure where to start as it is a github patch. It's also a huge
patch. The performance stuff you mention sounds appealing but I really don't
know enough just now, especially as I can't use this patch with trunk code.
Thank you

Automaton performance improvements based on Lucene code base

Key: NUTCH-1068
URL: https://issues.apache.org/jira/browse/NUTCH-1068
Project: Nutch
Issue Type: Improvement
Reporter: Kirby Bohling
Attachments: automaton.diff

The Lucene team maintains a modified Automaton library cut down to precisely
what they need. It can have significant performance enhancements.
I am attempting to backport and shepherd a patch for the original Automaton
library.
The original Lucene code is here:
http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/
The Lucene code is likely slightly faster, as it includes several micro
optimizations I removed to avoid having to request re-license permission. I
would definitely performance test using the Lucene RegEx vs. the patched
code. The Lucene code also uses code points not characters, which might make
a difference for UTF-16 vs. UTF-32 in obscure cases (I believe the Lucene
code builds a UTF-32 clean DFA for accuracy, and then translates it to a
UTF-8 DFA for performance but I'm not 100% sure. I don't need/use any of
that code, and currently really only worried about ASCII DFAs).
When making heavy use of the NFA-to-DFA transformation, I see a 4x speed up.
It likely has a 1.5-2x speed up for regular expression execution from what I
can tell. The Nutch backend uses this code in a couple of places, and it
likely would lead to performance benefits for those areas.
I will attach my backported version for the Automaton 1.11-7 release. While
I don't own any of the copyright, all of the code is copyrighted under the
BSD license, or the ASF 2.0 license. It is pretty obviously approved for ASF
usage. I am not checking that the patch is usable as I'm not the copyright
holder. If that is an issue, I'll say yes, I just don't believe I have any
legal standing to do so. I don't want to create licensing issues for the ASF.

[jira] [Commented] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-11-04 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143966#comment-13143966
 ] 

Lewis John McGibbney commented on NUTCH-1070:
-

His this issue to be closed off Radium?
Thanks

 Run nutch under native windows (no cygwin)
 --

 Key: NUTCH-1070
 URL: https://issues.apache.org/jira/browse/NUTCH-1070
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.3
 Environment: Windows XP Home
Reporter: Radim Kolar
Priority: Minor
  Labels: windows

 Its possible to run Nutch in windows without cygwin. 
 1. Startup script needs to be ported from SH to BAT
 2. Because hadoop runs on unix only, we must emulate unix commands to make it 
 work. Luckily only chmod, bash and df needs to be emulated

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-11-04 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144084#comment-13144084
 ] 

Lewis John McGibbney commented on NUTCH-1070:
-

Thanks for your comments Radim. 

Any objectives to closing this one off?

 Run nutch under native windows (no cygwin)
 --

 Key: NUTCH-1070
 URL: https://issues.apache.org/jira/browse/NUTCH-1070
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.3
 Environment: Windows XP Home
Reporter: Radim Kolar
Priority: Minor
  Labels: windows

 Its possible to run Nutch in windows without cygwin. 
 1. Startup script needs to be ported from SH to BAT
 2. Because hadoop runs on unix only, we must emulate unix commands to make it 
 work. Luckily only chmod, bash and df needs to be emulated

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1189) add commented out default settings to gora.properties files

2011-11-04 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144199#comment-13144199
 ] 

Lewis John McGibbney commented on NUTCH-1189:
-

In addition, there is scope to provide a much richer info resource within this 
file but I will get round to that later.

 add commented out default settings to gora.properties files 
 

 Key: NUTCH-1189
 URL: https://issues.apache.org/jira/browse/NUTCH-1189
 Project: Nutch
  Issue Type: Sub-task
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: nutchgora

 Attachments: NUTCH-1189-v2.patch, NUTCH-1189.patch


 This issues should have been dealt with as part of its parent issue, however 
 I think as it is a fairly lareg task in itself, it needs to be done 
 independently. The gora.properties file should, amongst other settings, and 
 beside the extreme basic defaults for sqlstore, include defaults for opening 
 HBase, Cassandra, etc servers on their default ports etc. Leaving this down 
 to individual interpretation puts a huge owness of the user, hence 
 constructing a barrier to entry for getting the configuration settings up and 
 running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]

2011-11-02 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142048#comment-13142048
 ] 

Lewis John McGibbney commented on NUTCH-1188:
-

Hi guys, I think it's critical that we get this one ironed out before we begin 
firing RC's. Can we confirm that in our trunk 1.4 development code (and in 
Nutchgora branch) that this has been sorted out previously and that it is only 
an issue in the now deprecated 1.4 branch. Thanks 

 ERROR util.LogUtil - Cannot log with method [null]
 --

 Key: NUTCH-1188
 URL: https://issues.apache.org/jira/browse/NUTCH-1188
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4
 Environment: no special enviroment
Reporter: Zhang JinYan
 Attachments: LogUtil.patch


 LogUtil has static fields,which is initialized like this:
 FATAL = Logger.class.getMethod(error, new Class[] { Object.class });
 but the Logger has no such method，the correct method is：
 void org.slf4j.Logger.error(String msg)
 So，LogUtil's static fields are not initialized correctly(they are null)
 ---
 Run crawl，you will find msg in hadoop.log：
 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null]
 java.lang.NullPointerException
   at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103)
   at java.io.PrintStream.write(PrintStream.java:432)
   at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
   at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
   at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85)
   at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168)
   at java.io.PrintStream.newLine(PrintStream.java:496)
   at java.io.PrintStream.println(PrintStream.java:757)
   at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492)
   at java.lang.Throwable.printStackTrace(Throwable.java:468)
   at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197)
   at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)
 
 Patch:
 FATAL = Logger.class.getMethod(error, new Class[] { String.class });

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1189) add commented out default settings to gora.properties files

2011-11-02 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142453#comment-13142453
 ] 

Lewis John McGibbney commented on NUTCH-1189:
-

Hi Ferdy, does this have any knock-on effect what what we would wish to include 
within gora.properties? I understand that you can manually add peoprties to 
your HBASEHOME/conf/hbase-site.xml, however if you think any additional 
properties would add value to this patch please re-submit the patch. Your usage 
of HBase far exceeds my use case so please feel free.

 add commented out default settings to gora.properties files 
 

 Key: NUTCH-1189
 URL: https://issues.apache.org/jira/browse/NUTCH-1189
 Project: Nutch
  Issue Type: Sub-task
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: nutchgora

 Attachments: NUTCH-1189.patch


 This issues should have been dealt with as part of its parent issue, however 
 I think as it is a fairly lareg task in itself, it needs to be done 
 independently. The gora.properties file should, amongst other settings, and 
 beside the extreme basic defaults for sqlstore, include defaults for opening 
 HBase, Cassandra, etc servers on their default ports etc. Leaving this down 
 to individual interpretation puts a huge owness of the user, hence 
 constructing a barrier to entry for getting the configuration settings up and 
 running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1189) add commented out default settings to gora.properties files

2011-11-01 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141238#comment-13141238
 ] 

Lewis John McGibbney commented on NUTCH-1189:
-

Ferdy, would it be possible for you to attach a patch for HBase (if required), 
I will work on the Cassandra stuff, then hopefully we can knock ours heads 
together with some others to get the remaining back ends included within the 
gora.poperties file.

 add commented out default settings to gora.properties files 
 

 Key: NUTCH-1189
 URL: https://issues.apache.org/jira/browse/NUTCH-1189
 Project: Nutch
  Issue Type: Sub-task
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: nutchgora


 This issues should have been dealt with as part of its parent issue, however 
 I think as it is a fairly lareg task in itself, it needs to be done 
 independently. The gora.properties file should, amongst other settings, and 
 beside the extreme basic defaults for sqlstore, include defaults for opening 
 HBase, Cassandra, etc servers on their default ports etc. Leaving this down 
 to individual interpretation puts a huge owness of the user, hence 
 constructing a barrier to entry for getting the configuration settings up and 
 running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]

2011-11-01 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141240#comment-13141240
 ] 

Lewis John McGibbney commented on NUTCH-1188:
-

Thank you for this patch. In the short term, when we get one other +1, I would 
like to commit. Can I ask you to have a look @ NUTCH-1138 and comment on 
whether the patch is any use for your activities. It is our vision to remove 
LogUtil and use the Slf4j/Log4j framework for all logging.
Thank you very much for this patch.

 ERROR util.LogUtil - Cannot log with method [null]
 --

 Key: NUTCH-1188
 URL: https://issues.apache.org/jira/browse/NUTCH-1188
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4
 Environment: no special enviroment
Reporter: Zhang JinYan
 Attachments: LogUtil.patch


 LogUtil has static fields,which is initialized like this:
 FATAL = Logger.class.getMethod(error, new Class[] { Object.class });
 but the Logger has no such method，the correct method is：
 void org.slf4j.Logger.error(String msg)
 So，LogUtil's static fields are not initialized correctly(they are null)
 ---
 Run crawl，you will find msg in hadoop.log：
 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null]
 java.lang.NullPointerException
   at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103)
   at java.io.PrintStream.write(PrintStream.java:432)
   at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
   at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
   at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85)
   at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168)
   at java.io.PrintStream.newLine(PrintStream.java:496)
   at java.io.PrintStream.println(PrintStream.java:757)
   at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492)
   at java.lang.Throwable.printStackTrace(Throwable.java:468)
   at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197)
   at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)
 
 Patch:
 FATAL = Logger.class.getMethod(error, new Class[] { String.class });

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]

2011-11-01 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141300#comment-13141300
 ] 

Lewis John McGibbney commented on NUTCH-1188:
-

Is it just me, or has this already been committed along with NUTCH-1078 in 
trunk [1]  when Julien fixed it in Nutchgora branch [2]!

[1] 
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/util/LogUtil.java?r1=1175075r2=1177290diff_format=h
[2] 
http://svn.apache.org/viewvc/nutch/branches/nutchgora/src/java/org/apache/nutch/util/LogUtil.java?r1=983885r2=988544diff_format=h

 ERROR util.LogUtil - Cannot log with method [null]
 --

 Key: NUTCH-1188
 URL: https://issues.apache.org/jira/browse/NUTCH-1188
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4
 Environment: no special enviroment
Reporter: Zhang JinYan
 Attachments: LogUtil.patch


 LogUtil has static fields,which is initialized like this:
 FATAL = Logger.class.getMethod(error, new Class[] { Object.class });
 but the Logger has no such method，the correct method is：
 void org.slf4j.Logger.error(String msg)
 So，LogUtil's static fields are not initialized correctly(they are null)
 ---
 Run crawl，you will find msg in hadoop.log：
 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null]
 java.lang.NullPointerException
   at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103)
   at java.io.PrintStream.write(PrintStream.java:432)
   at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
   at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
   at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85)
   at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168)
   at java.io.PrintStream.newLine(PrintStream.java:496)
   at java.io.PrintStream.println(PrintStream.java:757)
   at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492)
   at java.lang.Throwable.printStackTrace(Throwable.java:468)
   at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197)
   at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)
 
 Patch:
 FATAL = Logger.class.getMethod(error, new Class[] { String.class });

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1138) remove LogUtil from trunk and nutch gora

2011-11-01 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141376#comment-13141376
 ] 

Lewis John McGibbney commented on NUTCH-1138:
-

Hi. Current 1.4 development is located at the trunk area of the SVN area. Is 
this where the confusion is possibly stemming from?
When we make code commits, we are committing to the trunk 1.4 development, 
rather than the branch-1.4 development. The reasoning behind this can be seen 
on the latest announcement on the Nutch home page.

 remove LogUtil from trunk and nutch gora
 

 Key: NUTCH-1138
 URL: https://issues.apache.org/jira/browse/NUTCH-1138
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.5

 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch


 This should move towards the removal of the LogUtil class from both codebases 
 as per comments in NUTCH-1078.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1156) building errors with gora-hbase as a backend; update ivy.xml to use correct dependancies

2011-10-31 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13140068#comment-13140068
]

Lewis John McGibbney commented on NUTCH-1156:
-

I forgot that we had reopened this one, yes +1 for committing the thrift
exclusion.

building errors with gora-hbase as a backend; update ivy.xml to use correct
dependancies

Key: NUTCH-1156
URL: https://issues.apache.org/jira/browse/NUTCH-1156
Project: Nutch
Issue Type: Bug
Components: build
Affects Versions: nutchgora
Reporter: Ferdy
Fix For: nutchgora

Attachments: NUTCH-1156-v1.patch, NUTCH-1156-v2.patch,
NUTCH-1156-v3.patch, NUTCH-1156-v4.patch

This patch makes sure nutchgora can actually be built when gora-hbase is
uncommented in ivy.xml. Note that is still commented though, so sql is still
the default backend. However whenever one wishes to use hbase (as we do)
simply uncommenting the section in ivy.xml won't do the trick. This patch
fixes this. Changes in ivy.xml:
-Set correct version for gora-hbase and config.
-Add thrift to exclude as it is not available in the repos; it is not needed
in most cases but please correct me if I'm wrong.
-Additionally, it removes the comment that hbase library itself should be
manually added, as this not needed anymore.

[jira] [Commented] (NUTCH-1138) remove LogUtil from trunk and nutch gora

2011-10-31 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13140669#comment-13140669
 ] 

Lewis John McGibbney commented on NUTCH-1138:
-

OK so this patch for trunk seems to pass all my tests so far. Could I ask for 
someone to provisionally apply it and test for a day or so, as I'm expecting 
somewhere down the line for errors to slip through the net.

 remove LogUtil from trunk and nutch gora
 

 Key: NUTCH-1138
 URL: https://issues.apache.org/jira/browse/NUTCH-1138
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.5

 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch


 This should move towards the removal of the LogUtil class from both codebases 
 as per comments in NUTCH-1078.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-842) AutoGenerate WebPage code

2011-10-26 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135899#comment-13135899
 ] 

Lewis John McGibbney commented on NUTCH-842:


Good while ago that this issue was last in view. Does anyone have an opinion on 
where we are with this one. The patch doesn't incorporate the latter comments 
as above, is this something which would be required?

 AutoGenerate WebPage code
 -

 Key: NUTCH-842
 URL: https://issues.apache.org/jira/browse/NUTCH-842
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: nutchgora

 Attachments: NUTCH-842.patch


 This issue will track the addition of an ant task that will automatically 
 generate o.a.n.storage.WebPage (and ProtocolStatus and ParseStatus) from 
 src/gora/webpage.avsc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

1 2 >

1 - 100 of 142 matches

Mail list logo