[jira] Created: (NUTCH-706) Url regex normalizer
Url regex normalizer Key: NUTCH-706 URL: https://issues.apache.org/jira/browse/NUTCH-706 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Meghna Kukreja Priority: Minor Fix For: 1.0.0 Hey, I encountered the following problem while trying to crawl a site using nutch-trunk. In the file regex-normalize.xml, the following regex is used to remove session ids: pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern. This pattern also transforms a url, such as, newsId=2000484784794newsLang=en into newnewsLang=en (since it matches 'sId' in the 'newsId'), which is incorrect and hence does not get fetched. This expression needs to be changed to prevent this. Thanks, Meghna -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-706) Url regex normalizer
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677460#action_12677460 ] Meghna Kukreja commented on NUTCH-706: -- The pattern should be changed to: pattern([;_\?amp;]((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern Can someone please verify this? I am not very good with regular expressions :) Url regex normalizer Key: NUTCH-706 URL: https://issues.apache.org/jira/browse/NUTCH-706 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Meghna Kukreja Priority: Minor Fix For: 1.0.0 Hey, I encountered the following problem while trying to crawl a site using nutch-trunk. In the file regex-normalize.xml, the following regex is used to remove session ids: pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern. This pattern also transforms a url, such as, newsId=2000484784794newsLang=en into newnewsLang=en (since it matches 'sId' in the 'newsId'), which is incorrect and hence does not get fetched. This expression needs to be changed to prevent this. Thanks, Meghna -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-374) when http.content.limit be set to -1 and Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing.
[ http://issues.apache.org/jira/browse/NUTCH-374?page=comments#action_12438722 ] Meghna Kukreja commented on NUTCH-374: -- I have experienced this same problem and I fixed it by making this change to the function unzipBestEffort() in GZIPUtils.java: I changed this if statement: if ((written + size) sizeLimit) { outStream.write(buf, 0, sizeLimit - written); break; } to if ((written + size) sizeLimit sizeLimit = 0) { outStream.write(buf, 0, sizeLimit - written); break; } when http.content.limit be set to -1 and Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing. - Key: NUTCH-374 URL: http://issues.apache.org/jira/browse/NUTCH-374 Project: Nutch Issue Type: Bug Affects Versions: 0.8, 0.8.1 Reporter: King Kong I set http.content.limit to -1 to not truncate content being fetched. However , if response used gzip or x-gzip , then it was not able to uncompress. I found the problem is in HttpBase.processGzipEncoded (plugin lib-http) ... byte[] content = GZIPUtils.unzipBestEffort(compressed, getMaxContent()); ... because it is not deal with -1 to no limit , so must modify code to solve it; byte[] content; if (getMaxContent()=0){ content = GZIPUtils.unzipBestEffort(compressed, getMaxContent()); }else{ content = GZIPUtils.unzipBestEffort(compressed); } -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira