[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse "lastModified" to a config file.
[ https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1190: --- Component/s: plugin Fix Version/s: 1.18 > MoreIndexingFilter refactor: move data formats used to parse "lastModified" > to a config file. > - > > Key: NUTCH-1190 > URL: https://issues.apache.org/jira/browse/NUTCH-1190 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin >Affects Versions: 1.4 > Environment: jdk6 >Reporter: Zhang JinYan >Priority: Major > Fix For: 1.18 > > Attachments: MoreIndexingFilter.patch, NUTCH-1190-trunk.patch, > date-styles.txt > > > There many issues about missing date format: > [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871] > [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912] > [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015] > The data formats can be diverse, so why not move those data formats to a > extra config file? > I move all the data formats from "MoreIndexingFilter.java" to a file named > "date-styles.txt"(place in "conf"), which will be load on startup. > {code} > public void setConf(Configuration conf) { > this.conf = conf; > MIME = new MimeUtil(conf); > > URL res = conf.getResource("date-styles.txt"); > if(res==null){ > LOG.error("Can't find resource: date-styles.txt"); > }else{ > try { > List lines = FileUtils.readLines(new File(res.getFile())); > for (int i = 0; i < lines.size(); i++) { > String dateStyle = (String) lines.get(i); > if(StringUtils.isBlank(dateStyle)){ > lines.remove(i); > i--; > continue; > } > dateStyle=StringUtils.trim(dateStyle); > if(dateStyle.startsWith("#")){ > lines.remove(i); > i--; > continue; > } > lines.set(i, dateStyle); > } > dateStyles = new String[lines.size()]; > lines.toArray(dateStyles); > } catch (IOException e) { > LOG.error("Failed to load resource: date-styles.txt"); > } > } > } > {code} > Then parse "lastModified" like this(sample): > {code} > private long getTime(String date, String url) { > .. > Date parsedDate = DateUtils.parseDate(date, dateStyles); > time = parsedDate.getTime(); > .. > return time; > } > {code} > This path also contains the "path" of > [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140]. > Find more details in the patch file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.
[ https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1190: --- Fix Version/s: 1.8 MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file. - Key: NUTCH-1190 URL: https://issues.apache.org/jira/browse/NUTCH-1190 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.4 Environment: jdk6 Reporter: Zhang JinYan Fix For: 2.3, 1.8 Attachments: date-styles.txt, MoreIndexingFilter.patch, NUTCH-1190-trunk.patch There many issues about missing date format: [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871] [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912] [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015] The data formats can be diverse, so why not move those data formats to a extra config file? I move all the data formats from MoreIndexingFilter.java to a file named date-styles.txt(place in conf), which will be load on startup. {code} public void setConf(Configuration conf) { this.conf = conf; MIME = new MimeUtil(conf); URL res = conf.getResource(date-styles.txt); if(res==null){ LOG.error(Can't find resource: date-styles.txt); }else{ try { List lines = FileUtils.readLines(new File(res.getFile())); for (int i = 0; i lines.size(); i++) { String dateStyle = (String) lines.get(i); if(StringUtils.isBlank(dateStyle)){ lines.remove(i); i--; continue; } dateStyle=StringUtils.trim(dateStyle); if(dateStyle.startsWith(#)){ lines.remove(i); i--; continue; } lines.set(i, dateStyle); } dateStyles = new String[lines.size()]; lines.toArray(dateStyles); } catch (IOException e) { LOG.error(Failed to load resource: date-styles.txt); } } } {code} Then parse lastModified like this(sample): {code} private long getTime(String date, String url) { .. Date parsedDate = DateUtils.parseDate(date, dateStyles); time = parsedDate.getTime(); .. return time; } {code} This path also contains the path of [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140]. Find more details in the patch file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.
[ https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1190: Fix Version/s: 2.2 1.7 MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file. - Key: NUTCH-1190 URL: https://issues.apache.org/jira/browse/NUTCH-1190 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.4 Environment: jdk6 Reporter: Zhang JinYan Fix For: 1.7, 2.2 Attachments: date-styles.txt, MoreIndexingFilter.patch There many issues about missing date format: [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871] [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912] [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015] The data formats can be diverse, so why not move those data formats to a extra config file? I move all the data formats from MoreIndexingFilter.java to a file named date-styles.txt(place in conf), which will be load on startup. {code} public void setConf(Configuration conf) { this.conf = conf; MIME = new MimeUtil(conf); URL res = conf.getResource(date-styles.txt); if(res==null){ LOG.error(Can't find resource: date-styles.txt); }else{ try { List lines = FileUtils.readLines(new File(res.getFile())); for (int i = 0; i lines.size(); i++) { String dateStyle = (String) lines.get(i); if(StringUtils.isBlank(dateStyle)){ lines.remove(i); i--; continue; } dateStyle=StringUtils.trim(dateStyle); if(dateStyle.startsWith(#)){ lines.remove(i); i--; continue; } lines.set(i, dateStyle); } dateStyles = new String[lines.size()]; lines.toArray(dateStyles); } catch (IOException e) { LOG.error(Failed to load resource: date-styles.txt); } } } {code} Then parse lastModified like this(sample): {code} private long getTime(String date, String url) { .. Date parsedDate = DateUtils.parseDate(date, dateStyles); time = parsedDate.getTime(); .. return time; } {code} This path also contains the path of [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140]. Find more details in the patch file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.
[ https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhang JinYan updated NUTCH-1190: Attachment: date-styles.txt MoreIndexingFilter.patch MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file. - Key: NUTCH-1190 URL: https://issues.apache.org/jira/browse/NUTCH-1190 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.4 Environment: jdk6 Reporter: Zhang JinYan Attachments: MoreIndexingFilter.patch, date-styles.txt There many issues about missing date format: [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871] [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912] [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015] The data formats can be diverse, so why not move those data formats to a extra config file? I move all the data formats from MoreIndexingFilter.java to a file named date-styles.txt, which will be load on startup. {code} public void setConf(Configuration conf) { this.conf = conf; MIME = new MimeUtil(conf); URL res = conf.getResource(date-styles.txt); if(res==null){ LOG.error(Can't find resource: date-styles.txt); }else{ try { List lines = FileUtils.readLines(new File(res.getFile())); for (int i = 0; i lines.size(); i++) { String dateStyle = (String) lines.get(i); if(StringUtils.isBlank(dateStyle)){ lines.remove(i); i--; continue; } dateStyle=StringUtils.trim(dateStyle); if(dateStyle.startsWith(#)){ lines.remove(i); i--; continue; } lines.set(i, dateStyle); } dateStyles = new String[lines.size()]; lines.toArray(dateStyles); } catch (IOException e) { LOG.error(Failed to load resource: date-styles.txt); } } } {code} Then parse lastModified like this(sample): {code} private long getTime(String date, String url) { .. Date parsedDate = DateUtils.parseDate(date, dateStyles); time = parsedDate.getTime(); .. return time; } {code} This path also contains the path of [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140]. Find more details in the patch file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.
[ https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhang JinYan updated NUTCH-1190: Description: There many issues about missing date format: [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871] [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912] [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015] The data formats can be diverse, so why not move those data formats to a extra config file? I move all the data formats from MoreIndexingFilter.java to a file named date-styles.txt(place in conf), which will be load on startup. {code} public void setConf(Configuration conf) { this.conf = conf; MIME = new MimeUtil(conf); URL res = conf.getResource(date-styles.txt); if(res==null){ LOG.error(Can't find resource: date-styles.txt); }else{ try { List lines = FileUtils.readLines(new File(res.getFile())); for (int i = 0; i lines.size(); i++) { String dateStyle = (String) lines.get(i); if(StringUtils.isBlank(dateStyle)){ lines.remove(i); i--; continue; } dateStyle=StringUtils.trim(dateStyle); if(dateStyle.startsWith(#)){ lines.remove(i); i--; continue; } lines.set(i, dateStyle); } dateStyles = new String[lines.size()]; lines.toArray(dateStyles); } catch (IOException e) { LOG.error(Failed to load resource: date-styles.txt); } } } {code} Then parse lastModified like this(sample): {code} private long getTime(String date, String url) { .. Date parsedDate = DateUtils.parseDate(date, dateStyles); time = parsedDate.getTime(); .. return time; } {code} This path also contains the path of [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140]. Find more details in the patch file. was: There many issues about missing date format: [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871] [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912] [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015] The data formats can be diverse, so why not move those data formats to a extra config file? I move all the data formats from MoreIndexingFilter.java to a file named date-styles.txt, which will be load on startup. {code} public void setConf(Configuration conf) { this.conf = conf; MIME = new MimeUtil(conf); URL res = conf.getResource(date-styles.txt); if(res==null){ LOG.error(Can't find resource: date-styles.txt); }else{ try { List lines = FileUtils.readLines(new File(res.getFile())); for (int i = 0; i lines.size(); i++) { String dateStyle = (String) lines.get(i); if(StringUtils.isBlank(dateStyle)){ lines.remove(i); i--; continue; } dateStyle=StringUtils.trim(dateStyle); if(dateStyle.startsWith(#)){ lines.remove(i); i--; continue; } lines.set(i, dateStyle); } dateStyles = new String[lines.size()]; lines.toArray(dateStyles); } catch (IOException e) { LOG.error(Failed to load resource: date-styles.txt); } } } {code} Then parse lastModified like this(sample): {code} private long getTime(String date, String url) { .. Date parsedDate = DateUtils.parseDate(date, dateStyles); time = parsedDate.getTime(); .. return time; } {code} This path also contains the path of [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140]. Find more details in the patch file. MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file. - Key: NUTCH-1190 URL: https://issues.apache.org/jira/browse/NUTCH-1190 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.4 Environment: jdk6 Reporter: Zhang JinYan Attachments: MoreIndexingFilter.patch, date-styles.txt There many issues about missing date format: [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871] [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912] [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015] The data formats can be diverse, so why not move those data formats to a extra config file? I move all the data formats from MoreIndexingFilter.java to a file named date-styles.txt(place in conf), which will be load on startup. {code} public void setConf(Configuration conf) { this.conf = conf; MIME = new MimeUtil(conf); URL res = conf.getResource(date-styles.txt);