date:20061020

[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

2006-10-20 Thread Enis Soztutar (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-389?page=all ]

Enis Soztutar updated NUTCH-389:


Description: 
NutchAnalysis.jj tokenizes the input by threating & and _ as non token 
seperators, which is in the case of the urls not appropriate. So i have written 
a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As 
stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes 
the grammer for URIs, URL's can be tokenized with the above expression. 

NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the "url", 
"site" and "host" fields.


see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

  was:
NutchAnalysis.jj tokenizes the input by threating & and _ as non token 
seperators, which is in the case of the urls not appropriate. So i have written 
a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As 
stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes 
the grammer for URIs, URL's can be tokenized with the above expression. 


see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html


> a url tokenizer implementation for tokenizing index fields : url and host
> -
>
> Key: NUTCH-389
> URL: http://issues.apache.org/jira/browse/NUTCH-389
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 0.9.0
>Reporter: Enis Soztutar
>Priority: Minor
> Attachments: urlTokenizer.diff
>
>
> NutchAnalysis.jj tokenizes the input by threating & and _ as non token 
> seperators, which is in the case of the urls not appropriate. So i have 
> written a url tokenizer which the tokens that match the regular exp 
> [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html 
> which describes the grammer for URIs, URL's can be tokenized with the above 
> expression. 
> NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the 
> "url", "site" and "host" fields.
> see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

2006-10-20 Thread Enis Soztutar (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-389?page=all ]

Enis Soztutar updated NUTCH-389:


Attachment: urlTokenizer.diff

patch for url tokenization

> a url tokenizer implementation for tokenizing index fields : url and host
> -
>
> Key: NUTCH-389
> URL: http://issues.apache.org/jira/browse/NUTCH-389
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 0.9.0
>Reporter: Enis Soztutar
>Priority: Minor
> Attachments: urlTokenizer.diff
>
>
> NutchAnalysis.jj tokenizes the input by threating & and _ as non token 
> seperators, which is in the case of the urls not appropriate. So i have 
> written a url tokenizer which the tokens that match the regular exp 
> [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html 
> which describes the grammer for URIs, URL's can be tokenized with the above 
> expression. 
> see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

2006-10-20 Thread Enis Soztutar (JIRA)

a url tokenizer implementation for tokenizing index fields : url and host 
--

 Key: NUTCH-389
 URL: http://issues.apache.org/jira/browse/NUTCH-389
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Priority: Minor


NutchAnalysis.jj tokenizes the input by threating & and _ as non token 
seperators, which is in the case of the urls not appropriate. So i have written 
a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As 
stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes 
the grammer for URIs, URL's can be tokenized with the above expression. 


see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-387) host normalization in Generator$Selector

2006-10-20 Thread Otis Gospodnetic (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-387?page=comments#action_12443742 ] 

Otis Gospodnetic commented on NUTCH-387:


This indeed looks wrong.
My guess is that the new URL() line just needs to be removed, but I'm not 
sure, so I'll let somebody else make the actual change.

> host normalization in Generator$Selector
> 
>
> Key: NUTCH-387
> URL: http://issues.apache.org/jira/browse/NUTCH-387
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
> Environment: nutch trunk since revision 449088
>Reporter: Johannes Zillmann
>
> the host normalization in Generator$Selector#reduce at line 177 seems broken:
> String host = new URL(url.toString()).getHost();
> ...
> try {
> host = normalizers.normalize(host, 
> URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
> host = new URL(host).getHost().toLowerCase();
>  } catch (Exception e) {
>LOG.warn("Malformed URL: '" + host + "', skipping");
>  }
> With default configuration the basic nomalizer will be called, which is doing 
> 'new URL(host)'.
> Also in line below 'new URL(host)' will be called.
> Since url.getHost() always return the host without protocol, there will be a 
> MalformedUrlException be thrown, always.
> The job will continue as usual though, cause the exception is catched.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Problem parsing some MS Excel & other formats (Office 2003)

2006-10-20 Thread Aisha


Hi Andrzej ,

Thank you for your reply,

As I have a lot of raised exception, Could you please have a look at it and
said me if there is a way to solve them : 

  -  Error parsing: file:/C:/doc to index/conges.xls: failed(2,0): Can't be
handled as micrsosoft document.
org.apache.poi.hssf.record.RecordFormatException: Unable to construct record
instance, the following exception occured: null

  - Error parsing: file:/C:/docs_a_indexer/doc1/test.doc: failed(2,0): Can't
be handled as micrsosoft document. java.util.NoSuchElementException
 
  - Error parsing: file:/C:/docs_a_indexer/doc3/test.rtf: failed(2,0): Can't
be handled as micrsosoft document. java.io.IOException: Invalid header
signature; read 7015536635646467195, expected -2226271756974174256

  - 2006-10-13 17:29:42,343 ERROR parse.OutlinkExtractor - getOutlinks
java.net.MalformedURLException: unknown protocol: dsp
at java.net.URL.(URL.java:574)
at java.net.URL.(URL.java:464)
at java.net.URL.(URL.java:413)
at
org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78)
at org.apache.nutch.parse.Outlink.(Outlink.java:35)
at
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111)
at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:84)
at
org.apache.nutch.parse.msword.MSWordParser.getParse(MSWordParser.java:43)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:276)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152)


In the last error, the string after "unknown protocol: " is not always dsp,
it seems to be different in each case and I don't understand what mean this
string.

Thank you very much.

Best regards,
Aïcha 

Aisha wrote:
> Hi,
>
> I try with last releases nutch-2006-10-13.tar.gz and
> nutch-2006-10-19.tar.gz,
> but the NPE doesn't seem to be fixed, I always have the same exception
> message for a lot of document and a lot af format, excel but word and
> powerpoint too.:
>
> 2006-10-19 16:41:09,265 WARN  parse.ParseUtil - Unable to successfully
> parse
> content file://C:/docs_a_indexer/test.doc of type application/msword
> 2006-10-19 16:41:09,265 WARN  fetcher.Fetcher - Error parsing:
> file:/C:/docs_a_indexer/test.doc: failed(2,0): Can't be handled as
> Microsoft
> document. org.apache.nutch.parse.msword.FastSavedException: Fast-saved
> files
> are unsupported at this time
>
> Couls you please help me because the volume of rejected document is
> large...
>   

The reason for failure means that you can't parse these files using the 
lib-parsems plugins, because they use a "fast save" format, which is not 
supported.

Your only option is to use some other external parser through parse-ext 
plugin.

-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





-- 
View this message in context: 
http://www.nabble.com/Problem-parsing-some-MS-Excel---other-formats-%28Office-2003%29-tf2408217.html#a6911914
Sent from the Nutch - Dev mailing list archive at Nabble.com.

[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

[jira] Created: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

[jira] Commented: (NUTCH-387) host normalization in Generator$Selector

Re: Problem parsing some MS Excel & other formats (Office 2003)

5 matches

Site Navigation

Mail list logo

Footer information