[jira] [Updated] (NUTCH-1269) Generate main problems
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1269: Fix Version/s: 1.7 Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: Generate, MaxHostCount, MaxNumSegments Fix For: 1.7 Attachments: NUTCH-1269.patch, NUTCH-1269-v.2.patch there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1269) Generate main problems
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] behnam nikbakht updated NUTCH-1269: --- i rebuild a patch with current trank and upload it to solve these two problems: 1. when generate run in hadoop, multiple reduce tasks use common variables: count, currentsegmentnum, hostCounts and there are possibility of mistake in producing equal size segments 2. for efficient fetch, required that size of all segments be equal and distribution of hosts between segments be uniform. I Suggest that use a common atomic variables in Generator class for problem 1 and use this semantic for hostCounts: if there are 3 segments, for site a, if hostCounts[a] = {b,c,d}, this means that there are b url in segment 1 and c url in segment 2 and d url in segment 3 from host a Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: Generate, MaxHostCount, MaxNumSegments Attachments: NUTCH-1269.patch there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This
[jira] [Updated] (NUTCH-1269) Generate main problems
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] behnam nikbakht updated NUTCH-1269: --- Attachment: NUTCH-1269-v.2.patch Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: Generate, MaxHostCount, MaxNumSegments Attachments: NUTCH-1269-v.2.patch, NUTCH-1269.patch there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1269) Generate main problems
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] behnam nikbakht updated NUTCH-1269: --- Attachment: NUTCH-1269.patch Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: Generate, MaxHostCount, MaxNumSegments Attachments: NUTCH-1269.patch there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1269) Generate main problems
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] behnam nikbakht updated NUTCH-1269: --- Patch Info: Patch Available yes, thanks for your attention Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: Generate, MaxHostCount, MaxNumSegments Attachments: NUTCH-1269.patch there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira