[jira] Commented: (NUTCH-811) Develop an ORM framework
[ https://issues.apache.org/jira/browse/NUTCH-811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12865226#action_12865226 ] Enis Soztutar commented on NUTCH-811: - Hi Piet, The code for Gora will reside in GitHub for now, since Nutch and Gora are pretty orthogonal. But as stated before, Nutch is the first user of Gora, and Gora does not yet have a separate community so I intend to always keep nutch community updated (via this issue and nutch-dev mailing list), and hope for feedback from the Nutch community. Moreover, NutchBase has already been ported to using Gora, so at some point, Gora should be reviewed and accepted as a dependency for Nutch. Develop an ORM framework - Key: NUTCH-811 URL: https://issues.apache.org/jira/browse/NUTCH-811 Project: Nutch Issue Type: New Feature Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 2.0 By Nutch-808, it is clear that we need an ORM layer on top of the datastore, so that different backends can be used to store data. This issue will track the development of the ORM layer. Initially full support for HBase is planned, with RDBM, Hadoop MapFile and Cassandra support scheduled for later. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs
[ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar closed NUTCH-808. --- Resolution: Fixed We have decided to go on with implementing an ORM layer as per the discussion on NUTCH-811. Closing this issue. Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs -- Key: NUTCH-808 URL: https://issues.apache.org/jira/browse/NUTCH-808 Project: Nutch Issue Type: Task Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 2.0 We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. We want at least the following capabilities: - Using POJOs - Able to persist objects to at least HBase, Cassandra, and RDBMs - Able to efficiently serialize objects as task outputs from Hadoop jobs - Allow native queries, along with standard queries Any comments, suggestions for other frameworks are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs
[ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856360#action_12856360 ] Enis Soztutar commented on NUTCH-808: - bq. What do you mean by current implementation? NutchBase? Indeed. In package o.a.n.storage deals with ORM (though not all classes) bq. I know that Cascading have various Tape/Sink implementations including JDBC, HBase but also SimpleDB. Maybe it would be worth having a look at how they do it? The way cascading does this is to convert Tuples (cascading data structure) to HBase/JDBC records. The schema for HBase/JDBC is given as a metadata. Since they deal with only tuple - table row, it is not that difficult. But again, cascading does not allow for mapping lists to columns, etc. bq. My gut feeling would be to write a custom framework instead of relying on DataNucleus and use AVRO if possible. I really think that HBase support is urgently needed but am less convinced that we need MySQL in the very short term. Yeah, the more I think about it, the more I come to terms with custom implementation. However, I think we might benefit a lot from the ideas from JDO in the long term. Also, JDBC implementation may not be relevant for large scale deployments, but it will be a very nice side effect of the ORM layer, which will allow easy deployment, which in turn will hopefully bring more users. Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs -- Key: NUTCH-808 URL: https://issues.apache.org/jira/browse/NUTCH-808 Project: Nutch Issue Type: Task Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 2.0 We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. We want at least the following capabilities: - Using POJOs - Able to persist objects to at least HBase, Cassandra, and RDBMs - Able to efficiently serialize objects as task outputs from Hadoop jobs - Allow native queries, along with standard queries Any comments, suggestions for other frameworks are welcome. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs
[ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856124#action_12856124 ] Enis Soztutar commented on NUTCH-808: - So, this is the results so far : DataNucleus was previously known as JPOX and it was the reference implementation for Java Data objects (JDO). JDO is a java standard for persistence. A similar specification, named JPA is also a persistence standard, which is forked from EJB 3. However, JPA is designed for RDBMs only, so it will not be useful for us (http://www.datanucleus.org/products/accessplatform/persistence_api.html). In JDO, the first step is to define the domain objects as POJOs. Then, the persistance metadata is specified either using annotations, XML or both. Then a byte code enhancer uses instrumentation to add required methods to the classes defined as @PersistanceCapable. The database tables can be generated by hand, automatically by datanucleus, or by using a tool (SchemaTool). The persistence layer uses standard JDO syntax, which is similar to JDBC. The objects can be queried using JPQL. I have run a small test to persist objects of WebTableRow class (from NutchBase branch) to both MySQL and HBase. Although it took me a fair bit of time to set-up, I was able to persist objects to both. However, although it is possible to map complex fields (like lists, maps, arrays, etc) to RDBMs using different strategies (such as serializing directly, using Joins, using Foreign Keys), I was not able to find a way to leverage HBase data model. For example, we want to be able to map lists and maps to columns in column families. Without such functionality using column oriented stores does not bring any advantage. For the byte[] serialization for MapReduce, we can either implement a new datastore for datanucleus, which also implements Hadoop's Serialization, or use Avro to generate Java classes to be feed into JPOX enhancer, or else manually implement Writable. To sum up, datanucleus brings the following advantages : - out of the box RDBMs support - XML or annotation metadata - JDO is a Java standard - standard query interface - JSON support The disadvantages to use DataNucleus would be: - JDO is rather complex, Implementing a datastore is not very trivial - We need write patches to datanucleus to flexibly map complex fields to leverage HBase's data model - We have no control on the source code - no native Hbase support (for example using filters, etc) On the other hand, current implementation is - tested on production, - can leverage HBase data model, - can be modified to work with Avro serialization directly, - cassandra support could be added with little effort - can support multiple languages (in the future) I believe that having SQLite, MySQL and HBase support is critical for Nutch 2.0, for out-of-the-box use, ease of deployment and real-scale computing respectively. But obviously we cannot use DataNucleus out of the box either. ORM is inherently a hard problem. I propose we go ahead and make the changes to DataNucleus to see if it is feasible, and continue with it if it suits our needs. Of course, having a custom framework will also be great, so any feedback would be more than welcome. Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs -- Key: NUTCH-808 URL: https://issues.apache.org/jira/browse/NUTCH-808 Project: Nutch Issue Type: Task Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 2.0 We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. We want at least the following capabilities: - Using POJOs - Able to persist objects to at least HBase, Cassandra, and RDBMs - Able to efficiently serialize objects as task outputs from Hadoop jobs - Allow native queries, along with standard queries Any comments, suggestions for other frameworks are welcome. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs
Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs -- Key: NUTCH-808 URL: https://issues.apache.org/jira/browse/NUTCH-808 Project: Nutch Issue Type: Task Reporter: Enis Soztutar Assignee: Enis Soztutar We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. We want at least the following capabilities: - Using POJOs - Able to persist objects to at least HBase, Cassandra, and RDBMs - Able to efficiently serialize objects as task outputs from Hadoop jobs - Allow native queries, along with standard queries Any comments, suggestions for other frameworks are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs
[ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852840#action_12852840 ] Enis Soztutar commented on NUTCH-808: - A candidate framework is DataNucleus. It has the following benefits. - Apache 2 license. - JDO support - HBase, RDBMS, XML persistance. I will further investigate whether we can integrate Hadoop writables/Avro serialization so that objects can be passed from Mapred. Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs -- Key: NUTCH-808 URL: https://issues.apache.org/jira/browse/NUTCH-808 Project: Nutch Issue Type: Task Reporter: Enis Soztutar Assignee: Enis Soztutar We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. We want at least the following capabilities: - Using POJOs - Able to persist objects to at least HBase, Cassandra, and RDBMs - Able to efficiently serialize objects as task outputs from Hadoop jobs - Allow native queries, along with standard queries Any comments, suggestions for other frameworks are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-442) Integrate Solr/Nutch
[ https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12637489#action_12637489 ] Enis Soztutar commented on NUTCH-442: - I personally believe this patch should be in before 1.0, since it does not make sense to make such a change in 1.1. However since there is some need to test this patch more thoroughly, I guess we can make a branch and commit it there, so that people can test this easily. However branching has it's own problems, especially keeping in sync with trunk would get harder and harder. Since this issue has a large number of votes and watchers, I suggest we branch and commit it, test this out a little bit more, and merge to trunk before 1.0. Integrate Solr/Nutch Key: NUTCH-442 URL: https://issues.apache.org/jira/browse/NUTCH-442 Project: Nutch Issue Type: New Feature Environment: Ubuntu linux Reporter: rubdabadub Attachments: Crawl.patch, Indexer.patch, NUTCH-442_v4.patch, NUTCH-442_v5.patch, NUTCH-442_v6.patch.txt, NUTCH-442_v7.patch.txt, NUTCH-442_v7a.patch.txt, NUTCH_442_v3.patch, RFC_multiple_search_backends.patch, schema.xml Hi: After trying out Sami's patch regarding Solr/Nutch. Can be found here (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html) and I can confirm it worked :-) And that lead me to request the following : I would be very very great full if this could be included in nutch 0.9 as I am trying to eliminate my python based crawler which post documents to solr. As I am in the corporate enviornment I can't install trunk version in the production enviornment thus I am asking this to be included in 0.9 release. I hope my wish would be granted. I look forward to get some feedback. Thank you. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-588) Help Need
[ https://issues.apache.org/jira/browse/NUTCH-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar resolved NUTCH-588. - Resolution: Invalid Jira is not for asking questions. You should ask your questions on nutch-user mailing list. See http://lucene.apache.org/nutch/mailing_lists.html Closing this issue as invalid. Help Need - Key: NUTCH-588 URL: https://issues.apache.org/jira/browse/NUTCH-588 Project: Nutch Issue Type: Task Components: indexer Affects Versions: 0.7.2 Environment: Linux Reporter: Teccon Ingenieros Hello, We are trying to index a word file, if we put the static url like (/servlet/jsp/documento.doc) it works ok, put if we try to do the same with an dinamic url that generates that file (/servlet/jsp/leerFichero.jspid=112) it does´t work, it does´t index our url. What can we do? Regards, -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-586) Add option to run compiled classes w/o job file
[ https://issues.apache.org/jira/browse/NUTCH-586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548198 ] Enis Soztutar commented on NUTCH-586: - Can someone review this ? Add option to run compiled classes w/o job file --- Key: NUTCH-586 URL: https://issues.apache.org/jira/browse/NUTCH-586 Project: Nutch Issue Type: New Feature Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 1.0.0 Attachments: run-core_v1.patch bin/nutch adds nutch-*.job files under build and base directory to the classpath. However building the job file takes a long time. We have a target compile-core which builds only the core classes w/o plugins, but we need a way to run the compiled core class files. An option to bin/nutch to run the classes compiled with ant compile-core seems enough. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-586) Add option to run compiled classes w/o job file
[ https://issues.apache.org/jira/browse/NUTCH-586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-586: Attachment: run-core_v2.patch bq. I think you also need to put a comment, which clarifies that this works only in the local Hadoop mode. agreed. This patch addresses that. Add option to run compiled classes w/o job file --- Key: NUTCH-586 URL: https://issues.apache.org/jira/browse/NUTCH-586 Project: Nutch Issue Type: New Feature Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 1.0.0 Attachments: run-core_v1.patch, run-core_v2.patch bin/nutch adds nutch-*.job files under build and base directory to the classpath. However building the job file takes a long time. We have a target compile-core which builds only the core classes w/o plugins, but we need a way to run the compiled core class files. An option to bin/nutch to run the classes compiled with ant compile-core seems enough. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-586) Add option to run compiled classes w/o job file
[ https://issues.apache.org/jira/browse/NUTCH-586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-586: Attachment: run-core_v1.patch Attached file adds -core option to bin/nutch. Add option to run compiled classes w/o job file --- Key: NUTCH-586 URL: https://issues.apache.org/jira/browse/NUTCH-586 Project: Nutch Issue Type: New Feature Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 1.0.0 Attachments: run-core_v1.patch bin/nutch adds nutch-*.job files under build and base directory to the classpath. However building the job file takes a long time. We have a target compile-core which builds only the core classes w/o plugins, but we need a way to run the compiled core class files. An option to bin/nutch to run the classes compiled with ant compile-core seems enough. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-586) Add option to run compiled classes w/o job file
Add option to run compiled classes w/o job file --- Key: NUTCH-586 URL: https://issues.apache.org/jira/browse/NUTCH-586 Project: Nutch Issue Type: New Feature Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 1.0.0 Attachments: run-core_v1.patch bin/nutch adds nutch-*.job files under build and base directory to the classpath. However building the job file takes a long time. We have a target compile-core which builds only the core classes w/o plugins, but we need a way to run the compiled core class files. An option to bin/nutch to run the classes compiled with ant compile-core seems enough. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-583) FeedParser empty links for items
FeedParser empty links for items Key: NUTCH-583 URL: https://issues.apache.org/jira/browse/NUTCH-583 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 1.0.0 FeedParser in feed plugin just discards the item if it does not have link element. However Rss 2.0 does not necessitate the link element for each item. Moreover sometimes the link is given in the guid element which is a globally unique identifier for the item. I think we can search the url for an item first, then if it is still not found, we can use the feed's url, but with merging all the parse texts into one Parse object. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-573) Multiple Domains - Query Search
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543067 ] Enis Soztutar commented on NUTCH-573: - So, how shall we proceed with this one? I give +1 to commit this, and deal with NUTCH-479 in its own issue. Having both multi term queries and OR syntax wont be too bad i guess. Multiple Domains - Query Search --- Key: NUTCH-573 URL: https://issues.apache.org/jira/browse/NUTCH-573 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.9.0 Environment: All Reporter: Rajasekar Karthik Assignee: Enis Soztutar Fix For: 1.0.0 Attachments: multiTermQuery_v1.patch Searching multiple domains can be done on Lucene - nut not that efficiently on nutch. Query: +content:abc +(sitewww.aaa.com site:www.bbb.com) works on lucene but the same concept does not work on nutch. In Lucene, it works with org.apache.lucene.analysis.KeywordAnalyzer org.apache.lucene.analysis.standard.StandardAnalyzer but NOT on org.apache.lucene.analysis.SimpleAnalyzer Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? Just FYI, another solution (inefficient I believe) which seems to be working on nutch query -site:ccc.com -site:ddd.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-573) Multiple Domains - Query Search
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542389 ] Enis Soztutar commented on NUTCH-573: - bq. Using commas is IMHO not intuitive With your respect I should disagree. We cannot expect search users to type queries of the form +(site:www.somesite.com site:www.foo.com). Last time I checked google used comma syntax. I think that supporting only a subset of the query syntax that lucene supports was the initial intention to implement another query parser for nutch, so that ordinary search users will not get confused, and they can use the de-facto syntax. bq. Also, I'm not sure if the original reporter asked for a generic solution that would work with every field - if the issue at hand is just the site: field, then we can use raw field and a RawQueryFilter to parse multiple terms within the SiteQueryFilter implementation, without changing the global query syntax. The original intention was to allow this in only site queries, howeve i cannot see a reason to not enable this for other fields. Multiple Domains - Query Search --- Key: NUTCH-573 URL: https://issues.apache.org/jira/browse/NUTCH-573 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.9.0 Environment: All Reporter: Rajasekar Karthik Assignee: Enis Soztutar Fix For: 1.0.0 Attachments: multiTermQuery_v1.patch Searching multiple domains can be done on Lucene - nut not that efficiently on nutch. Query: +content:abc +(sitewww.aaa.com site:www.bbb.com) works on lucene but the same concept does not work on nutch. In Lucene, it works with org.apache.lucene.analysis.KeywordAnalyzer org.apache.lucene.analysis.standard.StandardAnalyzer but NOT on org.apache.lucene.analysis.SimpleAnalyzer Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? Just FYI, another solution (inefficient I believe) which seems to be working on nutch query -site:ccc.com -site:ddd.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-573) Multiple Domains - Query Search
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542449 ] Enis Soztutar commented on NUTCH-573: - @Andrzej I recall google over comma delimited syntax, but now it doesn't work. does it? Maybe I remembered wrong. http://www.google.com/intl/en/help/operators.html confirms that comma delimited syntax is not allowed, but a we can make allintitle: ... type queries. I think the raw fields, which are site, date, type and lang are unlikely to contain commas, so we may not have to worry about escape characters. As far as i know, we treat comma as white space, so searching comma-containing phrases in raw fields is not enabled anyway. Of course we may fix this should it be needed. @Dogacan I share the same concerns about performance and complexity about NUTCH-479. However it may be good if it were implemented correct. Multiple Domains - Query Search --- Key: NUTCH-573 URL: https://issues.apache.org/jira/browse/NUTCH-573 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.9.0 Environment: All Reporter: Rajasekar Karthik Assignee: Enis Soztutar Fix For: 1.0.0 Attachments: multiTermQuery_v1.patch Searching multiple domains can be done on Lucene - nut not that efficiently on nutch. Query: +content:abc +(sitewww.aaa.com site:www.bbb.com) works on lucene but the same concept does not work on nutch. In Lucene, it works with org.apache.lucene.analysis.KeywordAnalyzer org.apache.lucene.analysis.standard.StandardAnalyzer but NOT on org.apache.lucene.analysis.SimpleAnalyzer Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? Just FYI, another solution (inefficient I believe) which seems to be working on nutch query -site:ccc.com -site:ddd.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-573) Multiple Domains - Query Search
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar reassigned NUTCH-573: --- Assignee: Enis Soztutar Multiple Domains - Query Search --- Key: NUTCH-573 URL: https://issues.apache.org/jira/browse/NUTCH-573 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.9.0 Environment: All Reporter: Rajasekar Karthik Assignee: Enis Soztutar Priority: Minor Searching multiple domains can be done on Lucene - nut not that efficiently on nutch. Query: +content:abc +(sitewww.aaa.com site:www.bbb.com) works on lucene but the same concept does not work on nutch. In Lucene, it works with org.apache.lucene.analysis.KeywordAnalyzer org.apache.lucene.analysis.standard.StandardAnalyzer but NOT on org.apache.lucene.analysis.SimpleAnalyzer Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? Just FYI, another solution (inefficient I believe) which seems to be working on nutch query -site:ccc.com -site:ddd.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-573) Multiple Domains - Query Search
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-573: Attachment: multiTermQuery_v1.patch Here is a patch that enables querying multiple values for the same field. #The query syntax is changed to enable [field:]term1(,term2)* type queries, where multiple terms are converted to a boolean OR query. #Query.Clause, Query.Term, and Query.Phrase is changed significantly. This is an initial version of the patch for review, today I will test it a little bit more. Multiple Domains - Query Search --- Key: NUTCH-573 URL: https://issues.apache.org/jira/browse/NUTCH-573 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.9.0 Environment: All Reporter: Rajasekar Karthik Assignee: Enis Soztutar Priority: Minor Attachments: multiTermQuery_v1.patch Searching multiple domains can be done on Lucene - nut not that efficiently on nutch. Query: +content:abc +(sitewww.aaa.com site:www.bbb.com) works on lucene but the same concept does not work on nutch. In Lucene, it works with org.apache.lucene.analysis.KeywordAnalyzer org.apache.lucene.analysis.standard.StandardAnalyzer but NOT on org.apache.lucene.analysis.SimpleAnalyzer Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? Just FYI, another solution (inefficient I believe) which seems to be working on nutch query -site:ccc.com -site:ddd.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.
[ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541326 ] Enis Soztutar commented on NUTCH-574: - Why don't you just refactor indexing anchor code into another plugin, say index-anchor, enabled by default. Then all you need to do is to not use that plugin but only index-basic, right? That way we can avoid adding to the never-ending-list of configuration parameters *smile*. bq. The current idea is to have three options. An always include, never include, and include if matches text on page. In another issue, we can add a new plugin called index-anchor-matching that does its thing. Choosing from a list of plugins is the beauty of the plugin system after all. Including inlink anchor text in index can create irrelevant search results. --- Key: NUTCH-574 URL: https://issues.apache.org/jira/browse/NUTCH-574 Project: Nutch Issue Type: Bug Components: indexer Environment: All, basic indexing filter Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-574-1.patch Currently the basic indexing filter includes inbound anchor text for a given URL in the index. This sometimes allows pages to show up in search results where they may not be relevant. An example of this is a search for dallas hotels in our production index (www.visvo.com). Google would show up first in this example although there is no text matching either dallas or hotels on the google home page. What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms. I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.
[ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541359 ] Enis Soztutar commented on NUTCH-574: - Honestly, i don't think not indexing anchor words that do not appear in the web site text is not a wise solution. What made google so successful is indexing anchor text + PR, the classic example being that, the page http://www.honda.com/ never mentions that Honda is a car manufacturer, but the anchor text does. That said, I think we should focus on finding a way to eliminate the noise on anchor text. At this point we take the first 10K links and discard the others, due to size constraints. But a better way would be to select the best ones, or select the most frequent words, etc. Including inlink anchor text in index can create irrelevant search results. --- Key: NUTCH-574 URL: https://issues.apache.org/jira/browse/NUTCH-574 Project: Nutch Issue Type: Bug Components: indexer Environment: All, basic indexing filter Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-574-1.patch Currently the basic indexing filter includes inbound anchor text for a given URL in the index. This sometimes allows pages to show up in search results where they may not be relevant. An example of this is a search for dallas hotels in our production index (www.visvo.com). Google would show up first in this example although there is no text matching either dallas or hotels on the google home page. What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms. I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-442) Integrate Solr/Nutch
[ https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537954 ] Enis Soztutar commented on NUTCH-442: - Due to the method signature bug (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6267833) for {{ExecutorService#invokeAll}} the patch will not compile against 1.5. We should manage the lists as ListCallableT. Integrate Solr/Nutch Key: NUTCH-442 URL: https://issues.apache.org/jira/browse/NUTCH-442 Project: Nutch Issue Type: New Feature Environment: Ubuntu linux Reporter: rubdabadub Attachments: NUTCH_442_v3.patch, RFC_multiple_search_backends.patch, schema.xml Hi: After trying out Sami's patch regarding Solr/Nutch. Can be found here (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html) and I can confirm it worked :-) And that lead me to request the following : I would be very very great full if this could be included in nutch 0.9 as I am trying to eliminate my python based crawler which post documents to solr. As I am in the corporate enviornment I can't install trunk version in the production enviornment thus I am asking this to be included in 0.9 release. I hope my wish would be granted. I look forward to get some feedback. Thank you. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-442) Integrate Solr/Nutch
[ https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534869 ] Enis Soztutar commented on NUTCH-442: - Using nutch with solr has been a very demanding request, so it will be very useful when this makes into trunk. I have spend some time reviewing the patch, which I find quite elegant. Some improvements to the patch would be - make NutchDocument implement VersionedWritable instead of writable, and delegate version checking to superclass - refactor getDetails() methods in HitDetailer to Searcher (it is not likely that a class would implement Searcher but not HitDetailer) - use Searcher, delete HitDetailer and SearchBean - Rename XXXBean classes so that they do not include bean. (I think it is confusing to have bean objects that have non-trivial functionality) - refactor LuceneSearchBean.VERSION to RPCSearchBean - remove unrelated changes from the patch.(the changes in NGramProfile, HTMLLanguageParser,LanguageIdentifier,... correct me if i'm wrong) As far as i can see, we do not need any metadata for Solr backend, and only need Store,Index and Vector options for lucene backend, so i think we can simplify NutchDocument#metadata. We may implement : {code} class FieldMeta { o.a.l.document.Field.Store store; o.a.l.document.Field.Index index; o.a.l.document.Field.TermVector tv; } FieldMeta[] IndexingFilter.getFields(); class NutchDocument { ... private ArrayListField fieldMeta; ... } {code} Or alternatively we may wish to keep add methods of NutchDocument compatible with o.a.l.document.Document, keeping the metadata up-to-date as we add new fields, using this info at LuceneWriter, but ignoring in SolrWriter. This will be slightly slower but the API will be much more intuitive. Integrate Solr/Nutch Key: NUTCH-442 URL: https://issues.apache.org/jira/browse/NUTCH-442 Project: Nutch Issue Type: New Feature Environment: Ubuntu linux Reporter: rubdabadub Attachments: NUTCH_442_v3.patch, RFC_multiple_search_backends.patch, schema.xml Hi: After trying out Sami's patch regarding Solr/Nutch. Can be found here (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html) and I can confirm it worked :-) And that lead me to request the following : I would be very very great full if this could be included in nutch 0.9 as I am trying to eliminate my python based crawler which post documents to solr. As I am in the corporate enviornment I can't install trunk version in the production enviornment thus I am asking this to be included in 0.9 release. I hope my wish would be granted. I look forward to get some feedback. Thank you. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring
[ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521033 ] Enis Soztutar commented on NUTCH-439: - Recently Matt Cutts have written about parts of the urls : http://www.mattcutts.com/blog/seo-glossary-url-definitions/ it seems that, as expected, google deals with different parts of the urls. *smile* Top Level Domains Indexing / Scoring Key: NUTCH-439 URL: https://issues.apache.org/jira/browse/NUTCH-439 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch, tld_plugin_v2.3.patch Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as com, edu) and country code tlds(such as en, de , tr, ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-541) Index url field untokenized
Index url field untokenized --- Key: NUTCH-541 URL: https://issues.apache.org/jira/browse/NUTCH-541 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 1.0.0 Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the untokenized version of the url field in some contexts : 1. For deleting duplicates by url (at search time). see NUTCH-455 2. For restricting the search to a certain url (may be used in the case of RSS search where each entry in the Rss is added as a distinct document with (possibly) same url ) query-url extends FieldQueryFilter so: Query: url:http://www.apache.org/ Parsed: url:http http-www http-www-apache www www-apache apache org Translated: +url:http-http-www http-www-http-www-apache http-www-apache-www www-www-apache www-apache apache org 3. for accessing a document(s) in the search servers in the search servers. (using query plugin) I suggest we add url as in index-basic and implement a query-url-untoken plugin. doc.add(new Field(url, url.toString(), Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field(url_untoken, url.toString(), Field.Store.NO, Field.Index.UN_TOKENIZED)); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring
[ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-439: Attachment: tld_plugin_v2.3.patch bq. TLDScoringFilter contains a misspelled field, tldEnties, it should be renamed to tldEntries Done! bq. one of the use cases for the tld index field that you mention is that users may search on it. But in the latest patch this field is added with Field.Index.NO, which makes searching on it impossible. Also, in order to search on arbitrary Lucene fields Nutch needs a Query filter, so we would need a TLDQueryFilter, which doesn't exist (yet?). Well, infact NUTCH-445 covers searching on tlds, namely we would be able to search site:lucene.apache.org, or site:apache.org or even site:org, therefore i think indexing tld fields and TLDQueryFilter is not needed. I will delve deeper into NUTCH-445 as soon as i find some time. We can move domain indexing functionality to index-basic so that it will be generic enough. bq. using domain names instead of host names - we need to discuss this further, let's create a separate issue on this. we can open issues case by case since the patches is expected to have major side effects. Top Level Domains Indexing / Scoring Key: NUTCH-439 URL: https://issues.apache.org/jira/browse/NUTCH-439 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch, tld_plugin_v2.3.patch Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as com, edu) and country code tlds(such as en, de , tr, ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513819 ] Enis Soztutar commented on NUTCH-518: - Since there is no ordering among scoring filters, if we do something specific to zero in OpicScoring, it might lead to nondeterministic behaviour. Let's say for example the code in OpicScoring is changed so that : public float indexerScore(Text url, Document doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore) { if(initScore != 0) return (float)Math.pow(dbDatum.getScore(), scorePower) * initScore; else //do smt nasty } then there will be a big difference if scoring-opic is run before or after scoring-foo. As far as i can tell from the massages in mailing lists, scoring filters are used for restricting the crawl to topics, so zero-handling might broke topic-specific crawls. So my vote is to keep current implementation. Fix OpicScoringFilter to respect scoring filter chaining Key: NUTCH-518 URL: https://issues.apache.org/jira/browse/NUTCH-518 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Doğacan Güney Fix For: 1.0.0 Attachments: opicScoring.chain.patch Opic Scoring returns the score that it calculates, rather than returning previous_score * calculated_score. This prevents using another scoring filter along with Opic scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513826 ] Enis Soztutar commented on NUTCH-518: - I think removing initial score arguments and merging scores in ScoringFilters.java is a good idea overall +1 for this one. The final score should be calculated centrally. Maybe we may implement more than one way to calculate the score. Roughly ; ScoringFilters.getMultipliedScore() ScoringFilters.getSummedScore() ScoringFilters.getGeometricMeanScore() Fix OpicScoringFilter to respect scoring filter chaining Key: NUTCH-518 URL: https://issues.apache.org/jira/browse/NUTCH-518 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Doğacan Güney Fix For: 1.0.0 Attachments: opicScoring.chain.patch Opic Scoring returns the score that it calculates, rather than returning previous_score * calculated_score. This prevents using another scoring filter along with Opic scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining
Fix OpicScoringFilter to respect scoring filter chaining Key: NUTCH-518 URL: https://issues.apache.org/jira/browse/NUTCH-518 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Reporter: Enis Soztutar Fix For: 1.0.0 Opic Scoring returns the score that it calculates, rather than returning previous_score * calculated_score. This prevents using another scoring filter along with Opic scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-517) build encoding should be UTF-8
[ https://issues.apache.org/jira/browse/NUTCH-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-517: Attachment: build.encoding.patch Patch for UTF-8 is attached build encoding should be UTF-8 -- Key: NUTCH-517 URL: https://issues.apache.org/jira/browse/NUTCH-517 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Enis Soztutar Fix For: 1.0.0 Attachments: build.encoding.patch build encoding send to javac should be UTF-8 so that non-ascii characters can be used in the source code. This issue has emerged from NUTCH-439 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring
[ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-439: Attachment: tld_plugin_v2.2.patch This patch includes core domain utilities and the tld plugin, but excludes the changes in NUTCH-517 and NUTCH-518. Top Level Domains Indexing / Scoring Key: NUTCH-439 URL: https://issues.apache.org/jira/browse/NUTCH-439 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as com, edu) and country code tlds(such as en, de , tr, ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring
[ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-439: Attachment: tld_plugin_v2.0.patch I have made major improvements to the code and configuration files. Mainly the issue is not only a plugin, but a package, one big xml file, and an indexing/scoring plugin(which is disabled by default). The list of recognized suffixes now is not limited to top level domains. second, or third level public domain names can be recognized. The patch also changes the naming from top level domains to domain suffixes. This patch also introduces URLUtil class, which include methods for getting domain name, or public domain suffix of an url. Finding the domain name of a url is quite important for several reasons. First we can use this function as an replacement of URL.getHost() in LinkDB for ignoring internal links, or in similar context. Second we can perform statistical analysis on domain names. Third we can list subdomains under a domain, etc.. I have changed the build.encoding to UTF-8 so that non-ascii characters are recognized. here is an excerpt from the domain-suffixes.xml file : This document contains top level domains as described by the Internet Assigned Numbers Authotiry (IANA), and second or third level domains that are known to be managed by domain registerers. People at Mozilla community call these as public suffixes or effective tlds. There is no algorithmic way of knowing whether a suffix is a public domain suffix, or not. So this large file is used for this purpose. The entries in the file is used to find the domain of a url, which may not the same thing as the host of the url. For example for http://lucene.apache.org/nutch; the hostname is lucene.apache.org, however the domain name for this url would be apache.org. Domain names can be quite handy for statistical analysis, and fighting against spam. The list of TLDs is constructed from IANA, and the list of effective tlds are constructed from Wikipedia, http://wiki.mozilla.org/TLD_List, and http://publicsuffix.org/ The list may not include all the suffixes, but some effort has been spent to make it comprehensive. Please forward any improvements for this list to nutch-dev mailing list, or nutch JIRA. Top Level Domains Indexing / Scoring Key: NUTCH-439 URL: https://issues.apache.org/jira/browse/NUTCH-439 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as com, edu) and country code tlds(such as en, de , tr, ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring
[ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-439: Attachment: domain.suffixes_v2.1.patch Very nice patch! Thanks ! IP_PATTERN - it could be tighter, instead of \\d+ it could use \\d{1,3} now it is (\\d{1,3}\\.){3}(\\d{1,3}) the DomainStatistics tool: I'd rather see it as a separate JIRA issue. The reason is that it's a common request for enhancement, but specific requirements vary wildly. Some users prefer to build a separate DB that holds staistical info and can be used in various steps of the work cycle, others still prefer one-time tools such as this one. DomainStatistics is really a quick hack i've written for demonstration of the new patch. I've moved it from the latest patch. Once the user requirements are settled, we can move on from there. Also you may not want to commit MozillaPublicSuffixListParser.java, but it is good we have it somewhere public. Top Level Domains Indexing / Scoring Key: NUTCH-439 URL: https://issues.apache.org/jira/browse/NUTCH-439 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: domain.suffixes_v2.1.patch, tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as com, edu) and country code tlds(such as en, de , tr, ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring
[ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-439: Attachment: (was: domain.suffixes_v2.1.patch) Top Level Domains Indexing / Scoring Key: NUTCH-439 URL: https://issues.apache.org/jira/browse/NUTCH-439 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as com, edu) and country code tlds(such as en, de , tr, ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring
[ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-439: Attachment: tld_plugin_v2.1.patch Oops, it seems that i've uploaded the wrong file. This is the correct one. Top Level Domains Indexing / Scoring Key: NUTCH-439 URL: https://issues.apache.org/jira/browse/NUTCH-439 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch, tld_plugin_v2.1.patch Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as com, edu) and country code tlds(such as en, de , tr, ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-510) IndexMerger delete working dir
IndexMerger delete working dir -- Key: NUTCH-510 URL: https://issues.apache.org/jira/browse/NUTCH-510 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.0.0 Reporter: Enis Soztutar Fix For: 1.0.0 IndexMerger does not delete the working dir when an IOException is thrown such as No space left on device. Local temporary directories should be deleted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-510) IndexMerger delete working dir
[ https://issues.apache.org/jira/browse/NUTCH-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-510: Attachment: index.merger.delete.temp.dirs.patch Attached patch deletes working dirs on finally clause, eliminates java 5.0 warnings and in IndexMerger. An FileAlreadyExistsException is thrown if the output index directory already exists, which is similar to OutputFormatBase#chechOutputSpecs(); IndexMerger delete working dir -- Key: NUTCH-510 URL: https://issues.apache.org/jira/browse/NUTCH-510 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.0.0 Reporter: Enis Soztutar Fix For: 1.0.0 Attachments: index.merger.delete.temp.dirs.patch IndexMerger does not delete the working dir when an IOException is thrown such as No space left on device. Local temporary directories should be deleted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (NUTCH-510) IndexMerger delete working dir
[ https://issues.apache.org/jira/browse/NUTCH-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511043 ] Enis Soztutar edited comment on NUTCH-510 at 7/9/07 5:32 AM: - Attached patch deletes working dirs on finally clause, eliminates java 5.0 warnings in IndexMerger. An FileAlreadyExistsException is thrown if the output index directory already exists, which is similar to OutputFormatBase#chechOutputSpecs(); was: Attached patch deletes working dirs on finally clause, eliminates java 5.0 warnings and in IndexMerger. An FileAlreadyExistsException is thrown if the output index directory already exists, which is similar to OutputFormatBase#chechOutputSpecs(); IndexMerger delete working dir -- Key: NUTCH-510 URL: https://issues.apache.org/jira/browse/NUTCH-510 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.0.0 Reporter: Enis Soztutar Fix For: 1.0.0 Attachments: index.merger.delete.temp.dirs.patch IndexMerger does not delete the working dir when an IOException is thrown such as No space left on device. Local temporary directories should be deleted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-508) ${hadoop.log.dir} and ${hadoop.log.file} are not propagated to the tasktracker
[ https://issues.apache.org/jira/browse/NUTCH-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511121 ] Enis Soztutar commented on NUTCH-508: - Tasktracker invokes another jvm calling TaskTracker$Child but hadoop.log.dir is not passed as a parameter. The logs of the user program is handled correctly but i suppose the this is caused by the logs at TaskTracker$Child such as LOG.debug(Child starting); can not be logged. I suppose you can ask about this in hadoop-dev to see if it is indeed a hadoop issue. ${hadoop.log.dir} and ${hadoop.log.file} are not propagated to the tasktracker -- Key: NUTCH-508 URL: https://issues.apache.org/jira/browse/NUTCH-508 Project: Nutch Issue Type: Bug Environment: Linux 2.6, Java1.6 Reporter: Emmanuel Joke Fix For: 1.0.0 As described in http://www.nabble.com/Crawl-error-with-hadoop-t3994217.html the log4j config file is missing some parameters. hadoop.log.dir=. hadoop.log.file=hadoop.log Thanks for the help of Mathijs -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-471) Fix synchronization in NutchBean creation
[ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-471: Attachment: NutchBeanCreationSync_v2.patch From http://www-128.ibm.com/developerworks/java/library/j-dcl.html The bottom line is that double-checked locking, in whatever form, should not be used because you cannot guarantee that it will work on any JVM implementation. JSR-133 is addressing issues regarding the memory model, however, double-checked locking will not be supported by the new memory model. Therefore, you have two options: * Accept the synchronization of a getInstance() method as shown in Listing 2. * Forgo synchronization and use a static field. We don't want to remise performance in NutchBean.get(), synchronization is not a solution. Thus as Sami has suggested, i have written a ServetContextListener and added NutchBean construction code there. And modified web.xml to register the event listener class. Also In the servlet initialization, the Configuration object is initialized and cached by NutchConfiguration, so we avoid the same problem in NutchConfiguration.get(). i have tested the implementation and it seems OK. Fix synchronization in NutchBean creation - Key: NUTCH-471 URL: https://issues.apache.org/jira/browse/NUTCH-471 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 1.0.0 Reporter: Enis Soztutar Fix For: 1.0.0 Attachments: NutchBeanCreationSync_v1.patch, NutchBeanCreationSync_v2.patch NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-475) Adaptive crawl delay
[ https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491882 ] Enis Soztutar commented on NUTCH-475: - we can use a formula like : delay = alpha * delay + (1 - alpha) * (k * t) where 0 alpha = 1 so that the waiting time is less sensitive to varying reply times of the server. Adaptive crawl delay Key: NUTCH-475 URL: https://issues.apache.org/jira/browse/NUTCH-475 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Doğacan Güney Fix For: 1.0.0 Attachments: adaptive-delay_draft.patch Current fetcher implementation waits a default interval before making another request to the same server (if crawl-delay is not specified in robots.txt). IMHO, an adaptive implementation will be better. If the server is under little load and can server requests fast, then fetcher can ask for more pages in a given interval. Similarly, if the server is suffering from heavy load, fetcher can slow down(w.r.t that host), easing the load on the server. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-471) Fix synchronization in NutchBean creation
Fix synchronization in NutchBean creation - Key: NUTCH-471 URL: https://issues.apache.org/jira/browse/NUTCH-471 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 1.0.0 Reporter: Enis Soztutar Fix For: 1.0.0 NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-471) Fix synchronization in NutchBean creation
[ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-471: Attachment: NutchBeanCreationSync_v1.patch this patch synchronizes NutchBean.get((ServletContext app, Configuration conf) using servlet context as mutex. (NutchBean)app.getAttribute(nutchBean) is checked twice, the first one is not synchronized for performance reasons. Fix synchronization in NutchBean creation - Key: NUTCH-471 URL: https://issues.apache.org/jira/browse/NUTCH-471 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 1.0.0 Reporter: Enis Soztutar Fix For: 1.0.0 Attachments: NutchBeanCreationSync_v1.patch NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation
[ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491313 ] Enis Soztutar commented on NUTCH-471: - Nice trick with the unsynchronized check. :) Wow, indeed i have used a pattern w/o knowing about it :) Seemed a simple and efficient solution to me. Isn't the DCL declared to be broken? After reading http://en.wikipedia.org/wiki/Double-checked_locking, i can say that this a very subtle bug. As suggested we can fix it by declaring NutchBean volatile. However i guess, that in that case would the servlet container should also be configured to use Java 1.5 instead of 1.4. Fix synchronization in NutchBean creation - Key: NUTCH-471 URL: https://issues.apache.org/jira/browse/NUTCH-471 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 1.0.0 Reporter: Enis Soztutar Fix For: 1.0.0 Attachments: NutchBeanCreationSync_v1.patch NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-466) Flexible segment format
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485977 ] Enis Soztutar commented on NUTCH-466: - This patch will indeed resolve many issues related to storing extra information about the crawl. IMO MapFiles will do the job. Searcher API can be extended with an interface with a method like E extends Writable getInfo(T extends Writable); The implementing class should have a map of Class to MapFiles. Flexible segment format --- Key: NUTCH-466 URL: https://issues.apache.org/jira/browse/NUTCH-466 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assigned To: Andrzej Bialecki In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys. Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment parts, with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios. Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts. Example applications: * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents * storing pre-tokenized version of plain text for faster snippet generation * storing linguistically tagged text for sophisticated data mining * storing image thumbnails etc, etc ... I'm going to prepare a patchset shortly. Any comments and suggestions are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-466) Flexible segment format
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485996 ] Enis Soztutar commented on NUTCH-466: - There may be many parts that use the same key/value classes in MapFiles. Yes indeed you are right. I haven't thought about several parts having the same classes. I think the API should select the part by name (String) or some other ID, with a map of byte ID-s to directory names I thought that the map will be from class names to directory names. I think we should use the plugin model, with a registry of segment parts that are active for the current configuration Do you think that we sould also move HitDetailer, HitSummarizer, HitContent and Searcher to this plugin system. And should we break the multiple functionality in NutchBean and DistributedSearch$Client, and allow for separate index, segment servers? Flexible segment format --- Key: NUTCH-466 URL: https://issues.apache.org/jira/browse/NUTCH-466 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assigned To: Andrzej Bialecki In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys. Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment parts, with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios. Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts. Example applications: * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents * storing pre-tokenized version of plain text for faster snippet generation * storing linguistically tagged text for sophisticated data mining * storing image thumbnails etc, etc ... I'm going to prepare a patchset shortly. Any comments and suggestions are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-455) dedup on tokenized fields is faulty
[ https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479262 ] Enis Soztutar commented on NUTCH-455: - (from LUCENE-252) In nutch we have 3 options : 1st is to disallow deleting duplicates on tokenized fields(due to FieldCache), 2nd is to index the tokenized field twice(once tokenized, and once untokenized), 3rd use LUCENE-252 and the above patch and warm the cache initially in the index servers. I am in favor of the 3rd option. I think first resolving LUCENE-252, and then proceeding with NUTCH-255 is more sensible. dedup on tokenized fields is faulty --- Key: NUTCH-455 URL: https://issues.apache.org/jira/browse/NUTCH-455 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Fix For: 0.9.0 Attachments: IndexSearcherCacheWarm.patch (From LUCENE-252) nutch uses several index servers, and the search results from these servers are merged using a dedup field for for deleting duplicates. The values from this field is cached by Lucene's FieldCachImpl. The default is the site field, which is indexed and tokenized. However for a Tokenized Field (for example url in nutch), FieldCacheImpl returns an array of Terms rather that array of field values, so dedup'ing becomes faulty. Current FieldCache implementation does not respect tokenized fields , and as described above caches only terms. So in the situation that we are searching using url as the dedup field, when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of the url (such as www or com) rather that the whole url. This prevents using tokenized fields in the dedup field. I have written a patch for lucene and attached it in http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the aforementioned issue about tokenized field caching. However building such a cache for about 1.5M documents takes 20+ secs. The code in IndexSearcher.translateHits() starts with if (dedupField != null) dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField); and for the first call of search in IndexSearcher, cache is built. Long story short, i have written a patch against IndexSearcher, which in constructor warms-up the caches of wanted fields(configurable). I think we should vote for LUCENE-252, and then commit the above patch with the last version of lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-455) dedup on tokenized fields is faulty
dedup on tokenized fields is faulty --- Key: NUTCH-455 URL: https://issues.apache.org/jira/browse/NUTCH-455 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Fix For: 0.9.0 (From LUCENE-252) nutch uses several index servers, and the search results from these servers are merged using a dedup field for for deleting duplicates. The values from this field is cached by Lucene's FieldCachImpl. The default is the site field, which is indexed and tokenized. However for a Tokenized Field (for example url in nutch), FieldCacheImpl returns an array of Terms rather that array of field values, so dedup'ing becomes faulty. Current FieldCache implementation does not respect tokenized fields , and as described above caches only terms. So in the situation that we are searching using url as the dedup field, when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of the url (such as www or com) rather that the whole url. This prevents using tokenized fields in the dedup field. I have written a patch for lucene and attached it in http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the aforementioned issue about tokenized field caching. However building such a cache for about 1.5M documents takes 20+ secs. The code in IndexSearcher.translateHits() starts with if (dedupField != null) dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField); and for the first call of search in IndexSearcher, cache is built. Long story short, i have written a patch against IndexSearcher, which in constructor warms-up the caches of wanted fields(configurable). I think we should vote for LUCENE-252, and then commit the above patch with the last version of lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-445: Attachment: index_query_domain_v1.2.patch This patch is an update of the previous three patches. The patch 1. contains TranslatingRawFieldQueryFilter as an abstract implementation for searching certain fields in the index with a different query fieldname. 2. index-basic indexes the domain and all super domains in the domain field. 3.query-site is changed so that site:site_name will search domain:site_name By this plugin we can search site:apache.org, and get results from http://issues.apache.org, etc. or we can search site:com to retrieve all .com domains. Domain İndexing / Query Filter -- Key: NUTCH-445 URL: https://issues.apache.org/jira/browse/NUTCH-445 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: index_query_domain_v1.0.patch, index_query_domain_v1.1.patch, index_query_domain_v1.2.patch, TranslatingRawFieldQueryFilter_v1.0.patch Hostname's contain information about the domain of th host, and all of the subdomains. Indexing and Searching the domains are important for intuitive behavior. From DomainIndexingFilter javadoc : Adds the domain(hostname) and all super domains to the index. * br For http://lucene.apache.org/nutch/ the * following will be added to the index : br * ul * lilucene.apache.org /li * liapache/li * liorg /li * /ul * All hostnames are domain names, but not all the domain names are * hostnames. In the above example hostname lucene is a * subdomain of apache.org, which is itself a subdomain of * org br * Currently Basic indexing filter indexes the hostname in the site field, and query-site plugin allows to search in the site field. However site:apache.org will not return http://lucene.apache.org By indexing the domain, we can be able to search domains. Unlike the site field (indexed by BasicIndexingFilter) search, searching the domain field allows us to retrieve lucene.apache.org to the query apache.org. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-445) Domain İndexing / Query Filter
Domain İndexing / Query Filter -- Key: NUTCH-445 URL: https://issues.apache.org/jira/browse/NUTCH-445 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Hostname's contain information about the domain of th host, and all of the subdomains. Indexing and Searching the domains are important for intuitive behavior. From DomainIndexingFilter javadoc : Adds the domain(hostname) and all super domains to the index. * br For http://lucene.apache.org/nutch/ the * following will be added to the index : br * ul * lilucene.apache.org /li * liapache/li * liorg /li * /ul * All hostnames are domain names, but not all the domain names are * hostnames. In the above example hostname lucene is a * subdomain of apache.org, which is itself a subdomain of * org br * Currently Basic indexing filter indexes the hostname in the site field, and query-site plugin allows to search in the site field. However site:apache.org will not return http://lucene.apache.org By indexing the domain, we can be able to search domains. Unlike the site field (indexed by BasicIndexingFilter) search, searching the domain field allows us to retrieve lucene.apache.org to the query apache.org. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-445: Attachment: index_query_domain_v1.0.patch Patch for index-domain and query-domain plugins. Domain İndexing / Query Filter -- Key: NUTCH-445 URL: https://issues.apache.org/jira/browse/NUTCH-445 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: index_query_domain_v1.0.patch Hostname's contain information about the domain of th host, and all of the subdomains. Indexing and Searching the domains are important for intuitive behavior. From DomainIndexingFilter javadoc : Adds the domain(hostname) and all super domains to the index. * br For http://lucene.apache.org/nutch/ the * following will be added to the index : br * ul * lilucene.apache.org /li * liapache/li * liorg /li * /ul * All hostnames are domain names, but not all the domain names are * hostnames. In the above example hostname lucene is a * subdomain of apache.org, which is itself a subdomain of * org br * Currently Basic indexing filter indexes the hostname in the site field, and query-site plugin allows to search in the site field. However site:apache.org will not return http://lucene.apache.org By indexing the domain, we can be able to search domains. Unlike the site field (indexed by BasicIndexingFilter) search, searching the domain field allows us to retrieve lucene.apache.org to the query apache.org. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-445: Attachment: TranslatingRawFieldQueryFilter_v1.0.patch This patch complements index_query_domain_v1.0.patch. However, The class TranslatingRawFieldQueryFilter can be used independently, so i have put this in a seperate file. The javadoc reads : * Similar to [EMAIL PROTECTED] RawFieldQueryFilter} except that the index * and query field names can be different. * br * This class can be extended by codeQueryFilter/codes to allow * searching a field in the index, but using another field name in the * search. * br * For example index field names can be kept in english such as content, * lang, title, ..., however query filters can be build in other languages Domain İndexing / Query Filter -- Key: NUTCH-445 URL: https://issues.apache.org/jira/browse/NUTCH-445 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: index_query_domain_v1.0.patch, TranslatingRawFieldQueryFilter_v1.0.patch Hostname's contain information about the domain of th host, and all of the subdomains. Indexing and Searching the domains are important for intuitive behavior. From DomainIndexingFilter javadoc : Adds the domain(hostname) and all super domains to the index. * br For http://lucene.apache.org/nutch/ the * following will be added to the index : br * ul * lilucene.apache.org /li * liapache/li * liorg /li * /ul * All hostnames are domain names, but not all the domain names are * hostnames. In the above example hostname lucene is a * subdomain of apache.org, which is itself a subdomain of * org br * Currently Basic indexing filter indexes the hostname in the site field, and query-site plugin allows to search in the site field. However site:apache.org will not return http://lucene.apache.org By indexing the domain, we can be able to search domains. Unlike the site field (indexed by BasicIndexingFilter) search, searching the domain field allows us to retrieve lucene.apache.org to the query apache.org. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring
[ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-439: Attachment: tld_plugin_v1.1.patch I have forgotten to unset http.agent.name in the v1.0 accidentally. this version is the same except agent name is not set. This patch obsoletes v1.0. Top Level Domains Indexing / Scoring Key: NUTCH-439 URL: https://issues.apache.org/jira/browse/NUTCH-439 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as com, edu) and country code tlds(such as en, de , tr, ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-439) Top Level Domains Indexing / Scoring
Top Level Domains Indexing / Scoring Key: NUTCH-439 URL: https://issues.apache.org/jira/browse/NUTCH-439 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as com, edu) and country code tlds(such as en, de , tr, ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring
[ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-439: Attachment: tld_plugin_v1.0.patch This is a plugin implementation for indexing and scoring top level domains in nutch. Tlds are stored in TLDEntry class, which has fields domain, status and boost fileds. The tlds are read from an xml file. There is also a xsd for validation. TLDIndexingFilter implements IndexingFilter interface to index the domain extensions (such as net, org, en, de) in the tld field. TLDScoringFilter implements ScoringFilter interface. Basically this filter multiplies the initial boost(coming from another scoring filter such as opic) by the boost of the domain. This way, by configuring boost of say edu domains to 1.1, the document boosts in the index of educational sites is boosted by 1.1. Also local search engines may wish to boost the domains hosted in that country. For ex. boosting de domains a little in a German SE seems reasonable. An alternative usage may be to lower the boosts of domains such as biz, or info, which are known to have lots of spam. The users can also query the tld field for advanced search. Implementation note : 1. OpicScoringFilter is changed to respect ScoringFilter chaining. 2. some of the second level domains such as co.uk is not recognized, but edu.uk is recognized Top Level Domains Indexing / Scoring Key: NUTCH-439 URL: https://issues.apache.org/jira/browse/NUTCH-439 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: tld_plugin_v1.0.patch Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as com, edu) and country code tlds(such as en, de , tr, ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-251) Administration GUI
[ http://issues.apache.org/jira/browse/NUTCH-251?page=all ] Enis Soztutar updated NUTCH-251: Attachment: Nutch-251-AdminGUI.tar.gz I have updated the patch written by stephan. This version works with Nutch-0.9-dev and hadoop-0.7.1 (current version of nutch so far) First extract the tar.gaz file into the root of nutch. It should copy src/plugin/admin-* lib/xalan.jar lib/serializer.jar and lib/hadoop-0.7.2-dev.jar hadoop_0.7.1_nutch_gui_v2.patch nutch_0.9-dev_gui_v2.patch then patch nutch with patch -p0 nutch_0.9-dev_gui_v2.patch (you can test the patch first by running : patch -p0 --dry-run nutch_0.9-dev_gui_v2.patch Patched hadoop is included in the archive, but if you wish you can patch hadoop using patch -p0 hadoop_0.7.1_nutch_gui_v2.patch I have : converted necessary java.io.File fields and arguments to org.apache.hadoop.fs.Path replaced deprecated LogFormatter's with LogFactory's used generics with collections(changed only that I've seen) written PathSerializable which is implements Serializable interface(needed for scheduling) Some hadoop changes and some changes due to hadoop conflicts. I have not tested every feature of this plugin so, there still can be some bugs. Administration GUI -- Key: NUTCH-251 URL: http://issues.apache.org/jira/browse/NUTCH-251 Project: Nutch Issue Type: Improvement Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Minor Fix For: 0.9.0 Attachments: hadoop_nutch_gui_v1.patch, Nutch-251-AdminGUI.tar.gz, nutch_gui_plugins_v1.zip, nutch_gui_v1.patch Having a web based administration interface would help to make nutch administration and management much more user friendly. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-289) CrawlDatum should store IP address
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ] Enis Soztutar updated NUTCH-289: Attachment: ipInCrawlDatumDraftV5.1.patch The version 5 patch does not run on the current build. So i have fixed it and resend the patch(did not changed any code). I think this patch should be included in the trunk. CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Reporter: Doug Cutting Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.1.patch, ipInCrawlDatumDraftV5.patch If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host
[ http://issues.apache.org/jira/browse/NUTCH-389?page=all ] Enis Soztutar updated NUTCH-389: Attachment: urlTokenizer-improved.diff This is an improvement and a minor bug fix over the previous url tokenizer. This version first replaces characters, which are represented in hexadecimal format in the urls. For example the url file:///tmp/foo%20baz%20bar/foo/baz~bar/index.html will first be converted to file:///tmp/foo baz bar/foo/baz~bar/index.html by replacing the %20 characters with the space. A NullPointerException is corrected in case or input reader returning null for the url. Further improvements on the url tokenization can be discussed here. a url tokenizer implementation for tokenizing index fields : url and host - Key: NUTCH-389 URL: http://issues.apache.org/jira/browse/NUTCH-389 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Priority: Minor Attachments: urlTokenizer-improved.diff, urlTokenizer.diff NutchAnalysis.jj tokenizes the input by threating and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the url, site and host fields. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters
[ http://issues.apache.org/jira/browse/NUTCH-393?page=comments#action_12447787 ] Enis Soztutar commented on NUTCH-393: - Also IndexingException is catched by the Indexer, in which case the whole document is not added to the writer (the function returns). Indexer : 334 try { // run indexing filters doc = this.filters.filter(doc, parse, (UTF8)key, fetchDatum, inlinks); } catch (IndexingException e) { if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); } return; } IndexingException should be cought in the IndexingFilters.filter(), so that when an IndexingException is thrown in one indexing plugin, the other plugins could still be run. Indexer doesn't handle null documents returned by filters - Key: NUTCH-393 URL: http://issues.apache.org/jira/browse/NUTCH-393 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.8.1 Reporter: Eelco Lempsink Attachments: NUTCH-393.patch Plugins (like IndexingFilter) may return a null value, but this isn't handled by the Indexer. A trivial adjustment is all it takes: @@ -237,6 +237,7 @@ if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); } return; } +if (doc == null) return; float boost = 1.0f; // run scoring filters -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host
[ http://issues.apache.org/jira/browse/NUTCH-389?page=comments#action_12445512 ] Enis Soztutar commented on NUTCH-389: - Otis you can test the tokenizer using the TestUrlTokenizer junit test case. And you cab test the NutchDocumentTokenizer by running the NutchDocumentTokenizer's main method. NutchDocumentTokzenizer tokenizes http://www.foo_bar.com/baz_bar?cardar_mar as http www foo_bar com baz_bar cardar_mar whereas urlTokzenizer tokenizes the above url as http www foo bar com baz bar car dar mar so it will hit the queries baz, bar,car. dar and mar as well. for the url http://www.google.com.tr/firefox?client=firefox-arls=org.mozilla:en-US:official NutchDocumentTokenizer gives tokens : http www google com tr firefox client firefox arls org mozilla en us official urlTokenizer gives tokens : http www google com tr firefox client firefox a rls org mozilla en US official a url tokenizer implementation for tokenizing index fields : url and host - Key: NUTCH-389 URL: http://issues.apache.org/jira/browse/NUTCH-389 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Priority: Minor Attachments: urlTokenizer.diff NutchAnalysis.jj tokenizes the input by threating and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the url, site and host fields. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host
a url tokenizer implementation for tokenizing index fields : url and host -- Key: NUTCH-389 URL: http://issues.apache.org/jira/browse/NUTCH-389 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Priority: Minor NutchAnalysis.jj tokenizes the input by threating and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host
[ http://issues.apache.org/jira/browse/NUTCH-389?page=all ] Enis Soztutar updated NUTCH-389: Attachment: urlTokenizer.diff patch for url tokenization a url tokenizer implementation for tokenizing index fields : url and host - Key: NUTCH-389 URL: http://issues.apache.org/jira/browse/NUTCH-389 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Priority: Minor Attachments: urlTokenizer.diff NutchAnalysis.jj tokenizes the input by threating and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host
[ http://issues.apache.org/jira/browse/NUTCH-389?page=all ] Enis Soztutar updated NUTCH-389: Description: NutchAnalysis.jj tokenizes the input by threating and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the url, site and host fields. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html was: NutchAnalysis.jj tokenizes the input by threating and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html a url tokenizer implementation for tokenizing index fields : url and host - Key: NUTCH-389 URL: http://issues.apache.org/jira/browse/NUTCH-389 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Priority: Minor Attachments: urlTokenizer.diff NutchAnalysis.jj tokenizes the input by threating and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the url, site and host fields. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak
[ http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12431548 ] Enis Soztutar commented on NUTCH-356: - I observed strange behaviour, when one of the plug-ins could not be included. For example the plugin system fails to load plugins, when, there is a circular dependency among them or the name of the plug-in is misspelled in the configuration. Plugin repository cache can lead to memory leak --- Key: NUTCH-356 URL: http://issues.apache.org/jira/browse/NUTCH-356 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Enrico Triolo Attachments: NutchTest.java, patch.txt While I was trying to solve a problem I reported a while ago (see Nutch-314), I found out that actually the problem was related to the plugin cache used in class PluginRepository.java. As I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since I need to frequently submit new urls and append their contents to the index; I don't (and I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time a new url is submitted. Thus, I think in the majority of times you won't have problems using nutch as-is, since the problem I found occours only if nutch is used in a way similar to the one I use. To simplify your test I'm attaching a class that performs something similar to what I need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample urls list empty, so you should modify the source code and add some urls. Then you only have to run it and watch your memory consumption with top. In my experience I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic'). The problem is bound to the PluginRepository 'singleton' instance, since it never get released. It seems that some class maintains a reference to it and this class is never released since it is cached somewhere in the configuration. So I modified the PluginRepository's 'get' method so that it never uses the cache and always returns a new instance (you can find the patch in attachment). This way the memory consumption is always stable and I get no OOM anymore. Clearly this is not the solution, since I guess there are many performance issues involved, but for the moment it works. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira