[jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-07-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512201
 ] 

Doğacan Güney commented on NUTCH-505:
-

Andrzej, on my tests, java.util.regex is faster on both Java 1.5 and Java 1.6.

And btw, I added ( and ) as valid path characters to the relevant regex pattern 
because nutch was able to fetch a url containing them.

> Outlink urls should be validated
> 
>
> Key: NUTCH-505
> URL: https://issues.apache.org/jira/browse/NUTCH-505
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Doğacan Güney
>Assignee: Doğacan Güney
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: filtered.txt, NUTCH-505-v2.patch, NUTCH-505-v3.patch, 
> NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, 
> NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation 
> system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-07-12 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512139
 ] 

Andrzej Bialecki  commented on NUTCH-505:
-

Please test Java 1.5 and Java 1.6 - IIRC there are some differences in 
performance of java.util.regex between these two versions.

> Outlink urls should be validated
> 
>
> Key: NUTCH-505
> URL: https://issues.apache.org/jira/browse/NUTCH-505
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Doğacan Güney
>Assignee: Doğacan Güney
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: filtered.txt, NUTCH-505-v2.patch, NUTCH-505-v3.patch, 
> NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, 
> NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation 
> system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-07-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512074
 ] 

Doğacan Güney commented on NUTCH-505:
-

Thanks for the suggestion. Automaton really looks good, but using automaton in 
UrlValidator will mean bringing automaton jar inside nutch core (it currently 
resides in plugin urlfilter-automaton's lib). I am not sure if that's OK with 
everyone.

> Outlink urls should be validated
> 
>
> Key: NUTCH-505
> URL: https://issues.apache.org/jira/browse/NUTCH-505
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Doğacan Güney
>Assignee: Doğacan Güney
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-505-v2.patch, NUTCH-505.patch, NUTCH-505.patch, 
> NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation 
> system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-07-12 Thread Espen Amble Kolstad (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512071
 ] 

Espen Amble Kolstad commented on NUTCH-505:
---

Automaton (http://www.brics.dk/automaton/), used in AutomatonURLFilter, is even 
faster if you preparse the regex'es
It doesn't support all regex, but most.

> Outlink urls should be validated
> 
>
> Key: NUTCH-505
> URL: https://issues.apache.org/jira/browse/NUTCH-505
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Doğacan Güney
>Assignee: Doğacan Güney
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-505-v2.patch, NUTCH-505.patch, NUTCH-505.patch, 
> NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation 
> system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-07-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511985
 ] 

Hudson commented on NUTCH-505:
--

Integrated in Nutch-Nightly #147 (See 
[http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/147/])

> Outlink urls should be validated
> 
>
> Key: NUTCH-505
> URL: https://issues.apache.org/jira/browse/NUTCH-505
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Doğacan Güney
>Assignee: Doğacan Güney
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, 
> NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation 
> system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-07-10 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511447
 ] 

Andrzej Bialecki  commented on NUTCH-505:
-

* In ParseOutputFormat, the calculation of outlinksToStore should not make 
repeating calls to job.getInt() - the value of db.max.outlinksper.page should 
be retrieved once per invocation of getRecordWriter().

* you should increase the version number of ParseData, and add a code to read 
the current version of  ParseData. Otherwise the updated code won't be able to 
read older segments.

Other than that, the patch looks great, +1 for committing it after fixing these 
issues.

> Outlink urls should be validated
> 
>
> Key: NUTCH-505
> URL: https://issues.apache.org/jira/browse/NUTCH-505
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Doğacan Güney
>Priority: Minor
> Attachments: NUTCH-505.patch, NUTCH-505_draft.patch, 
> NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation 
> system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-06-26 Thread Kai_testing Middleton
I can confirm that with NUTCH-505_draft_v2.patch I no longer get outlink urls 
that contain html mark-up as I was getting before on www.variety.com.

--Kai Middleton

- Original Message 
From: Doğacan Güney (JIRA) <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Monday, June 25, 2007 1:09:26 AM
Subject: [jira] Commented: (NUTCH-505) Outlink urls should be validated


[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507803
 ] 

Doğacan Güney commented on NUTCH-505:
-

btw, for http://www.variety.com/, these are the 'urls' filtered:

http:/
http://www.variety.com/
http://www.variety.com/
mailto:[EMAIL PROTECTED]
http://ad.doubleclick.net/jump/variety.dart/;sz=993x47;ord=' + randomnumber + '?
http://ad.doubleclick.net/ad/variety.dart/;sz=993x47;ord=' + randomnumber + '?

Since we will not distribute score to these, this patch may also slightly 
improve scoring.


> Outlink urls should be validated
> 
>
> Key: NUTCH-505
> URL: https://issues.apache.org/jira/browse/NUTCH-505
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Doğacan Güney
>Priority: Minor
> Attachments: NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html>
> Parse plugins may extract garbage urls from pages. We need a url validation 
> system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.








   

Need a vacation? Get great deals
to amazing places on Yahoo! Travel.
http://travel.yahoo.com/

[jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-06-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507803
 ] 

Doğacan Güney commented on NUTCH-505:
-

btw, for http://www.variety.com/, these are the 'urls' filtered:

http:/
http://www.variety.com/
http://www.variety.com/
mailto:[EMAIL PROTECTED]
http://ad.doubleclick.net/jump/variety.dart/;sz=993x47;ord=' + randomnumber + '?
http://ad.doubleclick.net/ad/variety.dart/;sz=993x47;ord=' + randomnumber + '?

Since we will not distribute score to these, this patch may also slightly 
improve scoring.


> Outlink urls should be validated
> 
>
> Key: NUTCH-505
> URL: https://issues.apache.org/jira/browse/NUTCH-505
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Doğacan Güney
>Priority: Minor
> Attachments: NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation 
> system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.