[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host
[ http://issues.apache.org/jira/browse/NUTCH-389?page=all ] Enis Soztutar updated NUTCH-389: Description: NutchAnalysis.jj tokenizes the input by threating & and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the "url", "site" and "host" fields. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html was: NutchAnalysis.jj tokenizes the input by threating & and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html > a url tokenizer implementation for tokenizing index fields : url and host > - > > Key: NUTCH-389 > URL: http://issues.apache.org/jira/browse/NUTCH-389 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 0.9.0 >Reporter: Enis Soztutar >Priority: Minor > Attachments: urlTokenizer.diff > > > NutchAnalysis.jj tokenizes the input by threating & and _ as non token > seperators, which is in the case of the urls not appropriate. So i have > written a url tokenizer which the tokens that match the regular exp > [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html > which describes the grammer for URIs, URL's can be tokenized with the above > expression. > NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the > "url", "site" and "host" fields. > see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host
[ http://issues.apache.org/jira/browse/NUTCH-389?page=all ] Enis Soztutar updated NUTCH-389: Attachment: urlTokenizer.diff patch for url tokenization > a url tokenizer implementation for tokenizing index fields : url and host > - > > Key: NUTCH-389 > URL: http://issues.apache.org/jira/browse/NUTCH-389 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 0.9.0 >Reporter: Enis Soztutar >Priority: Minor > Attachments: urlTokenizer.diff > > > NutchAnalysis.jj tokenizes the input by threating & and _ as non token > seperators, which is in the case of the urls not appropriate. So i have > written a url tokenizer which the tokens that match the regular exp > [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html > which describes the grammer for URIs, URL's can be tokenized with the above > expression. > see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host
a url tokenizer implementation for tokenizing index fields : url and host -- Key: NUTCH-389 URL: http://issues.apache.org/jira/browse/NUTCH-389 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Priority: Minor NutchAnalysis.jj tokenizes the input by threating & and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-387) host normalization in Generator$Selector
[ http://issues.apache.org/jira/browse/NUTCH-387?page=comments#action_12443742 ] Otis Gospodnetic commented on NUTCH-387: This indeed looks wrong. My guess is that the new URL() line just needs to be removed, but I'm not sure, so I'll let somebody else make the actual change. > host normalization in Generator$Selector > > > Key: NUTCH-387 > URL: http://issues.apache.org/jira/browse/NUTCH-387 > Project: Nutch > Issue Type: Bug > Components: generator > Environment: nutch trunk since revision 449088 >Reporter: Johannes Zillmann > > the host normalization in Generator$Selector#reduce at line 177 seems broken: > String host = new URL(url.toString()).getHost(); > ... > try { > host = normalizers.normalize(host, > URLNormalizers.SCOPE_GENERATE_HOST_COUNT); > host = new URL(host).getHost().toLowerCase(); > } catch (Exception e) { >LOG.warn("Malformed URL: '" + host + "', skipping"); > } > With default configuration the basic nomalizer will be called, which is doing > 'new URL(host)'. > Also in line below 'new URL(host)' will be called. > Since url.getHost() always return the host without protocol, there will be a > MalformedUrlException be thrown, always. > The job will continue as usual though, cause the exception is catched. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Problem parsing some MS Excel & other formats (Office 2003)
Hi Andrzej , Thank you for your reply, As I have a lot of raised exception, Could you please have a look at it and said me if there is a way to solve them : - Error parsing: file:/C:/doc to index/conges.xls: failed(2,0): Can't be handled as micrsosoft document. org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance, the following exception occured: null - Error parsing: file:/C:/docs_a_indexer/doc1/test.doc: failed(2,0): Can't be handled as micrsosoft document. java.util.NoSuchElementException - Error parsing: file:/C:/docs_a_indexer/doc3/test.rtf: failed(2,0): Can't be handled as micrsosoft document. java.io.IOException: Invalid header signature; read 7015536635646467195, expected -2226271756974174256 - 2006-10-13 17:29:42,343 ERROR parse.OutlinkExtractor - getOutlinks java.net.MalformedURLException: unknown protocol: dsp at java.net.URL.(URL.java:574) at java.net.URL.(URL.java:464) at java.net.URL.(URL.java:413) at org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78) at org.apache.nutch.parse.Outlink.(Outlink.java:35) at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111) at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:84) at org.apache.nutch.parse.msword.MSWordParser.getParse(MSWordParser.java:43) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:276) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152) In the last error, the string after "unknown protocol: " is not always dsp, it seems to be different in each case and I don't understand what mean this string. Thank you very much. Best regards, Aïcha Aisha wrote: > Hi, > > I try with last releases nutch-2006-10-13.tar.gz and > nutch-2006-10-19.tar.gz, > but the NPE doesn't seem to be fixed, I always have the same exception > message for a lot of document and a lot af format, excel but word and > powerpoint too.: > > 2006-10-19 16:41:09,265 WARN parse.ParseUtil - Unable to successfully > parse > content file://C:/docs_a_indexer/test.doc of type application/msword > 2006-10-19 16:41:09,265 WARN fetcher.Fetcher - Error parsing: > file:/C:/docs_a_indexer/test.doc: failed(2,0): Can't be handled as > Microsoft > document. org.apache.nutch.parse.msword.FastSavedException: Fast-saved > files > are unsupported at this time > > Couls you please help me because the volume of rejected document is > large... > The reason for failure means that you can't parse these files using the lib-parsems plugins, because they use a "fast save" format, which is not supported. Your only option is to use some other external parser through parse-ext plugin. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- View this message in context: http://www.nabble.com/Problem-parsing-some-MS-Excel---other-formats-%28Office-2003%29-tf2408217.html#a6911914 Sent from the Nutch - Dev mailing list archive at Nabble.com.