date:20091119

Re: ERROR: Too Many Fetch Failures

2009-11-19 Thread Eric Osgood


Julien,

Another thought - I just installed tomcat and solr - would that  
interfere with hadoop?

On Nov 19, 2009, at 2:41 PM, Eric Osgood wrote:


Julien,

Thanks for your help, how would I go about fixing this error now  
that it is diagnosed?


On Nov 19, 2009, at 1:50 PM, Julien Nioche wrote:

could be a communication problem between the node and the master.  
It is not
a fetching problem in the Nutch sense of the term but a Hadoop- 
related

issue.

2009/11/19 Eric Osgood 

This is the first time I have received this error while crawling.  
During a
crawl of 100K pages, one of the nodes had a task failed and cited  
"Too Many
Fetch Failures" as the reason. The job completed successfully but  
took about

3 times longer than normal. Here is the log output


2009-11-19 11:19:56,377 WARN  mapred.TaskTracker - Error running  
child

java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java: 
197)
 at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java: 
65)

 at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close 
(DFSClient.java:1575)

 at java.io.FilterInputStream.close(FilterInputStream.java:155)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:91)
 at
org.apache.hadoop.mapred.LineRecordReader.close 
(LineRecordReader.java:169)

 at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close 
(MapTask.java:198)

 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:346)
 at org.apache.hadoop.mapred.Child.main(Child.java:158)
2009-11-19 11:19:56,380 WARN  mapred.TaskRunner - Parent died.   
Exiting

attempt_200911191100_0001_m_29_1
2009-11-19 11:20:21,135 WARN  mapred.TaskRunner - Parent died.   
Exiting

attempt_200911191100_0001_r_04_1

Can Anyone tell me how to resolve this error?

Thanks,


Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood ,
www.lakemeadonline.com





--
DigitalPebble Ltd
http://www.digitalpebble.com





Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com

Re: ERROR: Too Many Fetch Failures

2009-11-19 Thread Eric Osgood


Julien,

Thanks for your help, how would I go about fixing this error now that  
it is diagnosed?


On Nov 19, 2009, at 1:50 PM, Julien Nioche wrote:

could be a communication problem between the node and the master. It  
is not

a fetching problem in the Nutch sense of the term but a Hadoop-related
issue.

2009/11/19 Eric Osgood 

This is the first time I have received this error while crawling.  
During a
crawl of 100K pages, one of the nodes had a task failed and cited  
"Too Many
Fetch Failures" as the reason. The job completed successfully but  
took about

3 times longer than normal. Here is the log output


2009-11-19 11:19:56,377 WARN  mapred.TaskTracker - Error running  
child

java.io.IOException: Filesystem closed
  at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java: 
197)
  at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java: 
65)

  at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close 
(DFSClient.java:1575)

  at java.io.FilterInputStream.close(FilterInputStream.java:155)
  at org.apache.hadoop.util.LineReader.close(LineReader.java:91)
  at
org.apache.hadoop.mapred.LineRecordReader.close 
(LineRecordReader.java:169)

  at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close 
(MapTask.java:198)

  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:346)
  at org.apache.hadoop.mapred.Child.main(Child.java:158)
2009-11-19 11:19:56,380 WARN  mapred.TaskRunner - Parent died.   
Exiting

attempt_200911191100_0001_m_29_1
2009-11-19 11:20:21,135 WARN  mapred.TaskRunner - Parent died.   
Exiting

attempt_200911191100_0001_r_04_1

Can Anyone tell me how to resolve this error?

Thanks,


Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood ,
www.lakemeadonline.com





--
DigitalPebble Ltd
http://www.digitalpebble.com

Re: ERROR: Too Many Fetch Failures

2009-11-19 Thread Julien Nioche

could be a communication problem between the node and the master. It is not
a fetching problem in the Nutch sense of the term but a Hadoop-related
issue.

2009/11/19 Eric Osgood 

> This is the first time I have received this error while crawling. During a
> crawl of 100K pages, one of the nodes had a task failed and cited "Too Many
> Fetch Failures" as the reason. The job completed successfully but took about
> 3 times longer than normal. Here is the log output
>
>
> 2009-11-19 11:19:56,377 WARN  mapred.TaskTracker - Error running child
> java.io.IOException: Filesystem closed
>at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:197)
>at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65)
>at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:1575)
>at java.io.FilterInputStream.close(FilterInputStream.java:155)
>at org.apache.hadoop.util.LineReader.close(LineReader.java:91)
>at
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:169)
>at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:198)
>at org.apache.hadoop.mapred.MapTask.run(MapTask.java:346)
>at org.apache.hadoop.mapred.Child.main(Child.java:158)
> 2009-11-19 11:19:56,380 WARN  mapred.TaskRunner - Parent died.  Exiting
> attempt_200911191100_0001_m_29_1
> 2009-11-19 11:20:21,135 WARN  mapred.TaskRunner - Parent died.  Exiting
> attempt_200911191100_0001_r_04_1
>
> Can Anyone tell me how to resolve this error?
>
> Thanks,
>
>
> Eric Osgood
> -
> Cal Poly - Computer Engineering, Moon Valley Software
> -
> eosg...@calpoly.edu, e...@lakemeadonline.com
> -
> www.calpoly.edu/~eosgood ,
> www.lakemeadonline.com
>
>


-- 
DigitalPebble Ltd
http://www.digitalpebble.com

ERROR: Too Many Fetch Failures

2009-11-19 Thread Eric Osgood

This is the first time I have received this error while crawling.  
During a crawl of 100K pages, one of the nodes had a task failed and  
cited "Too Many Fetch Failures" as the reason. The job completed  
successfully but took about 3 times longer than normal. Here is the  
log output



2009-11-19 11:19:56,377 WARN  mapred.TaskTracker - Error running child
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java: 
197)
at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java: 
65)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close 
(DFSClient.java:1575)

at java.io.FilterInputStream.close(FilterInputStream.java:155)
at org.apache.hadoop.util.LineReader.close(LineReader.java:91)
at org.apache.hadoop.mapred.LineRecordReader.close 
(LineRecordReader.java:169)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close 
(MapTask.java:198)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:346)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
2009-11-19 11:19:56,380 WARN  mapred.TaskRunner - Parent died.   
Exiting attempt_200911191100_0001_m_29_1
2009-11-19 11:20:21,135 WARN  mapred.TaskRunner - Parent died.   
Exiting attempt_200911191100_0001_r_04_1


Can Anyone tell me how to resolve this error?

Thanks,


Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com

Re: support for robot rules that include a wild card

2009-11-19 Thread Ken Krugler


Hi Jason,

I've been spending some time on an improved robots.txt parser, as part  
of my Bixo project.


One aspect is support for Google wildcard extensions.

I think this will be part of the proposed "crawler-commons" project  
where we'll put components that can/should be shared between Nutch,  
Bixo, Heritrix and Droids.


One thing that would be useful is to collect examples of "advanced"  
robots.txt files, in addition to broken ones.


It would be great if you could open a Jira issue and attach specific  
examples of the above that you know about.


Thanks!

-- Ken


On Nov 19, 2009, at 11:31am, J.G.Konrad wrote:


I'm using nutch-1.0 and have noticed after running some tests that the
robot rules parser does not support wildcard (a.k.a globbing) in
rules. This means the rule will not work like it was expected to by
the person who wrote the robots.txt file.  For example

User-Agent: *
Disallow: /somepath/*/someotherpath

Even yahoo has one rule ( http://m.www.yahoo.com/robots.txt )
User-agent: *
Disallow: /p/
Disallow: /r/
Disallow: /*?

With the popularity of the wildcard (*) in robots.txt files these days
what are the plans/thoughts on adding support for it in Nutch?

Thanks,
 Jason



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

support for robot rules that include a wild card

2009-11-19 Thread J.G.Konrad

I'm using nutch-1.0 and have noticed after running some tests that the
robot rules parser does not support wildcard (a.k.a globbing) in
rules. This means the rule will not work like it was expected to by
the person who wrote the robots.txt file.  For example

User-Agent: *
Disallow: /somepath/*/someotherpath

Even yahoo has one rule ( http://m.www.yahoo.com/robots.txt )
User-agent: *
Disallow: /p/
Disallow: /r/
Disallow: /*?

With the popularity of the wildcard (*) in robots.txt files these days
what are the plans/thoughts on adding support for it in Nutch?

Thanks,
  Jason

Nutch upgrade to Hadoop

2009-11-19 Thread John Martyniak

Does anybody know of any concrete plans to update Nutch to Hadoop  
0.20,  0.21?


Something like a Nutch 1.1 release, get in some bug fixes and get  
current on Hadoop?


I think that should be one of the goals.

My 2 cents.

-John

AW: AW: substitute unknown parts of the url

2009-11-19 Thread Myname To

thank you for regex annotation. my folder-name doesn't have special characters.
i will check up for more details about url-regex and crawling.
first time i use nutch-1.0 i had problems with plugins, so i switch to 0.9.

regards,
mailusenet





Von: Subhojit Roy 
An: nutch-user@lucene.apache.org
Gesendet: Donnerstag, den 19. November 2009, 16:05:55 Uhr
Betreff: Re: AW: substitute unknown parts of the url

yes [a-zA-Z]* will not match those names that contain special characters
like say -,!,@ etc. The other possibility is to try .* where . represents
any character (including special characters).

Interestingly when we tried the [a-zA-Z]* pattern with Nutch 1.0, it had
worked for us.

-sroy

On Thu, Nov 19, 2009 at 7:58 PM, Ken Krugler wrote:

>
> On Nov 19, 2009, at 2:15am, Myname To wrote:
>
>  Ken, thank you for answering my question.
>>
>> i try [^/]+ for the unknown part of the url, but unfortunately i get the
>> log:
>> ...
>> Stopping at depth=0 - no more URLs to fetch.
>> No URLs to fetch - check your seed list and URL filters.
>> crawl finished: crawl
>>
>> i try this and other code:
>>
>> http://([a-z0-9]*\.)*website.com+[a-zA-Z]+/known-folder/
>> http://([a-z0-9]*\.)*website.com(/*)(/known-folder)
>>
>> actually i don't realy unterstand using predefined char in this case. eg.
>> which part is to parenthesize, or when i have to use asterisk *, plus + or
>>  backslash follow by point \. and so on ..
>>
>
> You'll need to understand regular expressions if you plan to modify the URL
> filter patterns.
>
>
>  if the unknown part of the path has a name, isn't better to use something
>> like [a-zA-Z] or do i have  to add other chars in [^/]+ ?
>>
>
> [^/]+says to match one or more characters which are not equal to '/'. So
> that will match anything, versus the more explicit [a-zA-Z]+, which wouldn't
> match (for example) "some-folder".
>
> -- Ken
>
>
>
>
>  Von: Ken Krugler 
>> An: nutch-user@lucene.apache.org
>> Gesendet: Donnerstag, den 19. November 2009, 2:06:53 Uhr
>> Betreff: Re: substitute unknown parts of the url
>>
>>
>> On Nov 18, 2009, at 4:53pm, Myname To wrote:
>>
>>  hello
>>>
>>> can somebody help me with urlfilter. i need to fetch sites with this
>>> pattern:
>>>
>>> http://([a-z0-9]*\.)*website.com/unknown-folder/known-folder/
>>>
>>> first folder can vary, whereas host name and second folder are known.
>>>
>>> how can i substitute unknown parts (folders) of the url?
>>>
>>
>> Something like...
>>
>> http://([a-z0-9]*\.)*website.com/[ 
>> ^/]+/known-folder/
>>
>> -- Ken
>>
>> 
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>> __
>> Do You Yahoo!?
>> Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz
>> gegen Massenmails.
>> http://mail.yahoo.com
>>
>
> 
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


-- 
Subhojit Roy
Profound Technologies
(Search Solutions based on Open Source)
email: s...@profound.in
http://www.profound.in


__
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen 
Massenmails. 
http://mail.yahoo.com

Re: AW: substitute unknown parts of the url

2009-11-19 Thread Subhojit Roy

yes [a-zA-Z]* will not match those names that contain special characters
like say -,!,@ etc. The other possibility is to try .* where . represents
any character (including special characters).

Interestingly when we tried the [a-zA-Z]* pattern with Nutch 1.0, it had
worked for us.

-sroy

On Thu, Nov 19, 2009 at 7:58 PM, Ken Krugler wrote:

>
> On Nov 19, 2009, at 2:15am, Myname To wrote:
>
>  Ken, thank you for answering my question.
>>
>> i try [^/]+ for the unknown part of the url, but unfortunately i get the
>> log:
>> ...
>> Stopping at depth=0 - no more URLs to fetch.
>> No URLs to fetch - check your seed list and URL filters.
>> crawl finished: crawl
>>
>> i try this and other code:
>>
>> http://([a-z0-9]*\.)*website.com+[a-zA-Z]+/known-folder/
>> http://([a-z0-9]*\.)*website.com(/*)(/known-folder)
>>
>> actually i don't realy unterstand using predefined char in this case. eg.
>> which part is to parenthesize, or when i have to use asterisk *, plus + or
>>  backslash follow by point \. and so on ..
>>
>
> You'll need to understand regular expressions if you plan to modify the URL
> filter patterns.
>
>
>  if the unknown part of the path has a name, isn't better to use something
>> like [a-zA-Z] or do i have  to add other chars in [^/]+ ?
>>
>
> [^/]+says to match one or more characters which are not equal to '/'. So
> that will match anything, versus the more explicit [a-zA-Z]+, which wouldn't
> match (for example) "some-folder".
>
> -- Ken
>
>
>
>
>  Von: Ken Krugler 
>> An: nutch-user@lucene.apache.org
>> Gesendet: Donnerstag, den 19. November 2009, 2:06:53 Uhr
>> Betreff: Re: substitute unknown parts of the url
>>
>>
>> On Nov 18, 2009, at 4:53pm, Myname To wrote:
>>
>>  hello
>>>
>>> can somebody help me with urlfilter. i need to fetch sites with this
>>> pattern:
>>>
>>> http://([a-z0-9]*\.)*website.com/unknown-folder/known-folder/
>>>
>>> first folder can vary, whereas host name and second folder are known.
>>>
>>> how can i substitute unknown parts (folders) of the url?
>>>
>>
>> Something like...
>>
>> http://([a-z0-9]*\.)*website.com/[ 
>> ^/]+/known-folder/
>>
>> -- Ken
>>
>> 
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>> __
>> Do You Yahoo!?
>> Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz
>> gegen Massenmails.
>> http://mail.yahoo.com
>>
>
> 
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


-- 
Subhojit Roy
Profound Technologies
(Search Solutions based on Open Source)
email: s...@profound.in
http://www.profound.in

Re: AW: substitute unknown parts of the url

2009-11-19 Thread Ken Krugler



On Nov 19, 2009, at 2:15am, Myname To wrote:


Ken, thank you for answering my question.

i try [^/]+ for the unknown part of the url, but unfortunately i get  
the log:

...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl

i try this and other code:

http://([a-z0-9]*\.)*website.com+[a-zA-Z]+/known-folder/
http://([a-z0-9]*\.)*website.com(/*)(/known-folder)

actually i don't realy unterstand using predefined char in this  
case. eg. which part is to parenthesize, or when i have to use  
asterisk *, plus + or  backslash follow by point \. and so on ..


You'll need to understand regular expressions if you plan to modify  
the URL filter patterns.


if the unknown part of the path has a name, isn't better to use  
something like [a-zA-Z] or do i have  to add other chars in [^/]+ ?


[^/]+says to match one or more characters which are not equal to '/'.  
So that will match anything, versus the more explicit [a-zA-Z]+, which  
wouldn't match (for example) "some-folder".


-- Ken




Von: Ken Krugler 
An: nutch-user@lucene.apache.org
Gesendet: Donnerstag, den 19. November 2009, 2:06:53 Uhr
Betreff: Re: substitute unknown parts of the url


On Nov 18, 2009, at 4:53pm, Myname To wrote:


hello

can somebody help me with urlfilter. i need to fetch sites with this
pattern:

http://([a-z0-9]*\.)*website.com/unknown-folder/known-folder/

first folder can vary, whereas host name and second folder are known.

how can i substitute unknown parts (folders) of the url?


Something like...

http://([a-z0-9]*\.)*website.com/[^/]+/known-folder/

-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

__
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden  
Schutz gegen Massenmails.

http://mail.yahoo.com



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

AW: substitute unknown parts of the url

2009-11-19 Thread Myname To

hello,

after trying [^/] and *[a-z0-9]** for unknown folder in url path, i can't find 
any other url-regex to fix this problem.

i try a nother way out with regex-normalize.xml file. 

but after adding this lines, nutch still doesn't find any url ... 
...

  website.com/[a-zA-Z0-9]/known-folder/
  website.com/known-folder/

...

do i have to learn more about url-regex with nutch?

thank you for reply.

mailusenet 





Von: Myname To 
An: nutch-user@lucene.apache.org
Gesendet: Donnerstag, den 19. November 2009, 1:53:51 Uhr
Betreff: substitute unknown parts of the url

hello

can somebody help me with urlfilter. i need to fetch sites with this pattern: 

http://([a-z0-9]*\.)*website.com/unknown-folder/known-folder/

first folder can vary, whereas host name and second folder are known.

how can i substitute unknown parts (folders) of the url?

any help appreciated!

regards 
mailusenet

AW: substitute unknown parts of the url

2009-11-19 Thread Myname To

thank you sroy,

as i wrote to ken, i don't clearly understand regex in this case.
with your regex suggestion i get now error-log:

Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)

i am using nutch-0.9 on redhat.

and there is no problem with url like 
+^http://([a-z0-9]*\.)*website.com/known-folder/known-folder/

any other suggestions?

regards,
mailusenet

Von: Subhojit Roy 
An: nutch-user@lucene.apache.org
Gesendet: Donnerstag, den 19. November 2009, 10:13:12 Uhr
Betreff: Re: substitute unknown parts of the url

Hi,

Try the regular expression below.

+^http://([a-z0-9]*\.)*website.com/*[a-z0-9]**/known-folder/

-sroy

On Thu, Nov 19, 2009 at 6:23 AM, Myname To  wrote:

> hello
>
> can somebody help me with urlfilter. i need to fetch sites with this
> pattern:
>
> http://([a-z0-9]*\.)*website.com/unknown-folder/known-folder/
>
> first folder can vary, whereas host name and second folder are known.
>
> how can i substitute unknown parts (folders) of the url?
>
> any help appreciated!
>
> regards
> mailusenet
>
>
>

-- 
Subhojit Roy
Profound Technologies
(Search Solutions based on Open Source)
email: s...@profound.in
http://www.profound.in

__
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen 
Massenmails. 
http://mail.yahoo.com

AW: substitute unknown parts of the url

2009-11-19 Thread Myname To

Ken, thank you for answering my question.

i try [^/]+ for the unknown part of the url, but unfortunately i get the log:
...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl

i try this and other code:

http://([a-z0-9]*\.)*website.com+[a-zA-Z]+/known-folder/
http://([a-z0-9]*\.)*website.com(/*)(/known-folder)

actually i don't realy unterstand using predefined char in this case. eg. which 
part is to parenthesize, or when i have to use asterisk *, plus + or  backslash 
follow by point \. and so on ..

if the unknown part of the path has a name, isn't better to use something like 
[a-zA-Z] or do i have  to add other chars in [^/]+ ?

regards
mailusenet

Von: Ken Krugler 
An: nutch-user@lucene.apache.org
Gesendet: Donnerstag, den 19. November 2009, 2:06:53 Uhr
Betreff: Re: substitute unknown parts of the url

On Nov 18, 2009, at 4:53pm, Myname To wrote:

> hello
>
> can somebody help me with urlfilter. i need to fetch sites with this  
> pattern:
>
> http://([a-z0-9]*\.)*website.com/unknown-folder/known-folder/
>
> first folder can vary, whereas host name and second folder are known.
>
> how can i substitute unknown parts (folders) of the url?

Something like...

http://([a-z0-9]*\.)*website.com/[^/]+/known-folder/

-- Ken

Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

__
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen 
Massenmails. 
http://mail.yahoo.com

Re: substitute unknown parts of the url

2009-11-19 Thread Subhojit Roy

Hi,

Try the regular expression below.

+^http://([a-z0-9]*\.)*website.com/*[a-z0-9]**/known-folder/

-sroy


On Thu, Nov 19, 2009 at 6:23 AM, Myname To  wrote:

> hello
>
> can somebody help me with urlfilter. i need to fetch sites with this
> pattern:
>
> http://([a-z0-9]*\.)*website.com/unknown-folder/known-folder/
>
> first folder can vary, whereas host name and second folder are known.
>
> how can i substitute unknown parts (folders) of the url?
>
> any help appreciated!
>
> regards
> mailusenet
>
>
>




-- 
Subhojit Roy
Profound Technologies
(Search Solutions based on Open Source)
email: s...@profound.in
http://www.profound.in

Re: ERROR: Too Many Fetch Failures

Re: ERROR: Too Many Fetch Failures

Re: ERROR: Too Many Fetch Failures

ERROR: Too Many Fetch Failures

Re: support for robot rules that include a wild card

support for robot rules that include a wild card

Nutch upgrade to Hadoop

AW: AW: substitute unknown parts of the url

Re: AW: substitute unknown parts of the url

Re: AW: substitute unknown parts of the url

AW: substitute unknown parts of the url

AW: substitute unknown parts of the url

AW: substitute unknown parts of the url

Re: substitute unknown parts of the url

14 matches

Site Navigation

Mail list logo

Footer information