[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse "lastModified" to a config file.

2020-08-03 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1190:
---
  Component/s: plugin
Fix Version/s: 1.18

> MoreIndexingFilter refactor: move data formats used to parse "lastModified" 
> to a config file.
> -
>
> Key: NUTCH-1190
> URL: https://issues.apache.org/jira/browse/NUTCH-1190
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.4
> Environment: jdk6
>Reporter: Zhang JinYan
>Priority: Major
> Fix For: 1.18
>
> Attachments: MoreIndexingFilter.patch, NUTCH-1190-trunk.patch, 
> date-styles.txt
>
>
> There many issues about missing date format:
> [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
> [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
> [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]
> The data formats can be diverse, so why not move those data formats to a 
> extra config file?
> I move all the data formats from "MoreIndexingFilter.java" to a file named 
> "date-styles.txt"(place in "conf"), which will be load on startup.
> {code}
>   public void setConf(Configuration conf) {
> this.conf = conf;
> MIME = new MimeUtil(conf);
> 
> URL res = conf.getResource("date-styles.txt");
> if(res==null){
>   LOG.error("Can't find resource: date-styles.txt");
> }else{
>   try {
> List lines = FileUtils.readLines(new File(res.getFile()));
> for (int i = 0; i < lines.size(); i++) {
>   String dateStyle = (String) lines.get(i);
>   if(StringUtils.isBlank(dateStyle)){
> lines.remove(i);
> i--;
> continue;
>   }
>   dateStyle=StringUtils.trim(dateStyle);
>   if(dateStyle.startsWith("#")){
> lines.remove(i);
> i--;
> continue;
>   }
>   lines.set(i, dateStyle);
> }
> dateStyles = new String[lines.size()];
> lines.toArray(dateStyles);
>   } catch (IOException e) {
> LOG.error("Failed to load resource: date-styles.txt");
>   }
> }
>   }
> {code}
> Then parse "lastModified" like this(sample):
> {code}
>   private long getTime(String date, String url) {
> ..
> Date parsedDate = DateUtils.parseDate(date, dateStyles);
> time = parsedDate.getTime();
> ..
> return time;
>   }
> {code}
> This path also contains the "path" of 
> [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
> Find more details in the patch file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.

2013-05-22 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1190:
---

Fix Version/s: 1.8

 MoreIndexingFilter refactor: move data formats used to parse lastModified 
 to a config file.
 -

 Key: NUTCH-1190
 URL: https://issues.apache.org/jira/browse/NUTCH-1190
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.4
 Environment: jdk6
Reporter: Zhang JinYan
 Fix For: 2.3, 1.8

 Attachments: date-styles.txt, MoreIndexingFilter.patch, 
 NUTCH-1190-trunk.patch


 There many issues about missing date format:
 [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
 [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
 [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]
 The data formats can be diverse, so why not move those data formats to a 
 extra config file?
 I move all the data formats from MoreIndexingFilter.java to a file named 
 date-styles.txt(place in conf), which will be load on startup.
 {code}
   public void setConf(Configuration conf) {
 this.conf = conf;
 MIME = new MimeUtil(conf);
 
 URL res = conf.getResource(date-styles.txt);
 if(res==null){
   LOG.error(Can't find resource: date-styles.txt);
 }else{
   try {
 List lines = FileUtils.readLines(new File(res.getFile()));
 for (int i = 0; i  lines.size(); i++) {
   String dateStyle = (String) lines.get(i);
   if(StringUtils.isBlank(dateStyle)){
 lines.remove(i);
 i--;
 continue;
   }
   dateStyle=StringUtils.trim(dateStyle);
   if(dateStyle.startsWith(#)){
 lines.remove(i);
 i--;
 continue;
   }
   lines.set(i, dateStyle);
 }
 dateStyles = new String[lines.size()];
 lines.toArray(dateStyles);
   } catch (IOException e) {
 LOG.error(Failed to load resource: date-styles.txt);
   }
 }
   }
 {code}
 Then parse lastModified like this(sample):
 {code}
   private long getTime(String date, String url) {
 ..
 Date parsedDate = DateUtils.parseDate(date, dateStyles);
 time = parsedDate.getTime();
 ..
 return time;
   }
 {code}
 This path also contains the path of 
 [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
 Find more details in the patch file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1190:


Fix Version/s: 2.2
   1.7

 MoreIndexingFilter refactor: move data formats used to parse lastModified 
 to a config file.
 -

 Key: NUTCH-1190
 URL: https://issues.apache.org/jira/browse/NUTCH-1190
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.4
 Environment: jdk6
Reporter: Zhang JinYan
 Fix For: 1.7, 2.2

 Attachments: date-styles.txt, MoreIndexingFilter.patch


 There many issues about missing date format:
 [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
 [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
 [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]
 The data formats can be diverse, so why not move those data formats to a 
 extra config file?
 I move all the data formats from MoreIndexingFilter.java to a file named 
 date-styles.txt(place in conf), which will be load on startup.
 {code}
   public void setConf(Configuration conf) {
 this.conf = conf;
 MIME = new MimeUtil(conf);
 
 URL res = conf.getResource(date-styles.txt);
 if(res==null){
   LOG.error(Can't find resource: date-styles.txt);
 }else{
   try {
 List lines = FileUtils.readLines(new File(res.getFile()));
 for (int i = 0; i  lines.size(); i++) {
   String dateStyle = (String) lines.get(i);
   if(StringUtils.isBlank(dateStyle)){
 lines.remove(i);
 i--;
 continue;
   }
   dateStyle=StringUtils.trim(dateStyle);
   if(dateStyle.startsWith(#)){
 lines.remove(i);
 i--;
 continue;
   }
   lines.set(i, dateStyle);
 }
 dateStyles = new String[lines.size()];
 lines.toArray(dateStyles);
   } catch (IOException e) {
 LOG.error(Failed to load resource: date-styles.txt);
   }
 }
   }
 {code}
 Then parse lastModified like this(sample):
 {code}
   private long getTime(String date, String url) {
 ..
 Date parsedDate = DateUtils.parseDate(date, dateStyles);
 time = parsedDate.getTime();
 ..
 return time;
   }
 {code}
 This path also contains the path of 
 [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
 Find more details in the patch file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.

2011-11-01 Thread Zhang JinYan (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang JinYan updated NUTCH-1190:


Attachment: date-styles.txt
MoreIndexingFilter.patch

 MoreIndexingFilter refactor: move data formats used to parse lastModified 
 to a config file.
 -

 Key: NUTCH-1190
 URL: https://issues.apache.org/jira/browse/NUTCH-1190
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.4
 Environment: jdk6
Reporter: Zhang JinYan
 Attachments: MoreIndexingFilter.patch, date-styles.txt


 There many issues about missing date format:
 [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
 [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
 [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]
 The data formats can be diverse, so why not move those data formats to a 
 extra config file?
 I move all the data formats from MoreIndexingFilter.java to a file named 
 date-styles.txt, which will be load on startup.
 {code}
   public void setConf(Configuration conf) {
 this.conf = conf;
 MIME = new MimeUtil(conf);
 
 URL res = conf.getResource(date-styles.txt);
 if(res==null){
   LOG.error(Can't find resource: date-styles.txt);
 }else{
   try {
 List lines = FileUtils.readLines(new File(res.getFile()));
 for (int i = 0; i  lines.size(); i++) {
   String dateStyle = (String) lines.get(i);
   if(StringUtils.isBlank(dateStyle)){
 lines.remove(i);
 i--;
 continue;
   }
   dateStyle=StringUtils.trim(dateStyle);
   if(dateStyle.startsWith(#)){
 lines.remove(i);
 i--;
 continue;
   }
   lines.set(i, dateStyle);
 }
 dateStyles = new String[lines.size()];
 lines.toArray(dateStyles);
   } catch (IOException e) {
 LOG.error(Failed to load resource: date-styles.txt);
   }
 }
   }
 {code}
 Then parse lastModified like this(sample):
 {code}
   private long getTime(String date, String url) {
 ..
 Date parsedDate = DateUtils.parseDate(date, dateStyles);
 time = parsedDate.getTime();
 ..
 return time;
   }
 {code}
 This path also contains the path of 
 [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
 Find more details in the patch file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.

2011-11-01 Thread Zhang JinYan (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang JinYan updated NUTCH-1190:


Description: 
There many issues about missing date format:
[NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
[NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
[NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]

The data formats can be diverse, so why not move those data formats to a extra 
config file?
I move all the data formats from MoreIndexingFilter.java to a file named 
date-styles.txt(place in conf), which will be load on startup.
{code}
  public void setConf(Configuration conf) {
this.conf = conf;
MIME = new MimeUtil(conf);

URL res = conf.getResource(date-styles.txt);
if(res==null){
  LOG.error(Can't find resource: date-styles.txt);
}else{
  try {
List lines = FileUtils.readLines(new File(res.getFile()));
for (int i = 0; i  lines.size(); i++) {
  String dateStyle = (String) lines.get(i);
  if(StringUtils.isBlank(dateStyle)){
lines.remove(i);
i--;
continue;
  }
  dateStyle=StringUtils.trim(dateStyle);
  if(dateStyle.startsWith(#)){
lines.remove(i);
i--;
continue;
  }
  lines.set(i, dateStyle);
}
dateStyles = new String[lines.size()];
lines.toArray(dateStyles);
  } catch (IOException e) {
LOG.error(Failed to load resource: date-styles.txt);
  }
}
  }
{code}
Then parse lastModified like this(sample):
{code}
  private long getTime(String date, String url) {
..
Date parsedDate = DateUtils.parseDate(date, dateStyles);
time = parsedDate.getTime();
..
return time;
  }
{code}
This path also contains the path of 
[NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
Find more details in the patch file.


  was:
There many issues about missing date format:
[NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
[NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
[NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]

The data formats can be diverse, so why not move those data formats to a extra 
config file?
I move all the data formats from MoreIndexingFilter.java to a file named 
date-styles.txt, which will be load on startup.
{code}
  public void setConf(Configuration conf) {
this.conf = conf;
MIME = new MimeUtil(conf);

URL res = conf.getResource(date-styles.txt);
if(res==null){
  LOG.error(Can't find resource: date-styles.txt);
}else{
  try {
List lines = FileUtils.readLines(new File(res.getFile()));
for (int i = 0; i  lines.size(); i++) {
  String dateStyle = (String) lines.get(i);
  if(StringUtils.isBlank(dateStyle)){
lines.remove(i);
i--;
continue;
  }
  dateStyle=StringUtils.trim(dateStyle);
  if(dateStyle.startsWith(#)){
lines.remove(i);
i--;
continue;
  }
  lines.set(i, dateStyle);
}
dateStyles = new String[lines.size()];
lines.toArray(dateStyles);
  } catch (IOException e) {
LOG.error(Failed to load resource: date-styles.txt);
  }
}
  }
{code}
Then parse lastModified like this(sample):
{code}
  private long getTime(String date, String url) {
..
Date parsedDate = DateUtils.parseDate(date, dateStyles);
time = parsedDate.getTime();
..
return time;
  }
{code}
This path also contains the path of 
[NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
Find more details in the patch file.



 MoreIndexingFilter refactor: move data formats used to parse lastModified 
 to a config file.
 -

 Key: NUTCH-1190
 URL: https://issues.apache.org/jira/browse/NUTCH-1190
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.4
 Environment: jdk6
Reporter: Zhang JinYan
 Attachments: MoreIndexingFilter.patch, date-styles.txt


 There many issues about missing date format:
 [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
 [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
 [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]
 The data formats can be diverse, so why not move those data formats to a 
 extra config file?
 I move all the data formats from MoreIndexingFilter.java to a file named 
 date-styles.txt(place in conf), which will be load on startup.
 {code}
   public void setConf(Configuration conf) {
 this.conf = conf;
 MIME = new MimeUtil(conf);
 
 URL res = conf.getResource(date-styles.txt);