Re: [WELCOME] Feng Lu as Apache Nutch PMC and Committer

2013-03-17 Thread kiran chitturi
Congrats Feng. Welcome onboard. On Tue, Mar 12, 2013 at 6:43 PM, lewis john mcgibbney lewi...@apache.orgwrote: Hi Everyone, On behalf of the Nutch PMC I would like to announce and welcome Feng Lu on board as PMC and Committer on the project. Amongst others, Feng has been an important part

Re: Parse benchmark/performance

2013-03-17 Thread ytthet
Hi Folks, I found out where the issue was. Just thought it might be useful for others. The performance issue I was facing in parse was due to the regular expression URL filter and funny URL. regex-URLfilter plugin. One of the regular expression was taking long... very long to process for some

Re: Parse benchmark/performance

2013-03-17 Thread kiran chitturi
Thank you Ye for updating us with your findings. It is best to use the latest version of Nutch since there are updates and fixes for each version On Sun, Mar 17, 2013 at 3:48 AM, ytthet yethura.t...@gmail.com wrote: Hi Folks, I found out where the issue was. Just thought it might be useful

Re: [WELCOME] Feng Lu as Apache Nutch PMC and Committer

2013-03-17 Thread feng lu
Thanks a lot to everyone for inviting me. I'm a software engineer in China, I have been using Apache Nutch for three years. In our team, I mainly responsible for modifying nutch 1.x to suit the requirements of our database Mongodb. So i also write a simple database abstraction layer to adapt

Re: Understanding fetch MapReduce job counters and logs

2013-03-17 Thread feng lu
yes, the property is fetcher.timelimit.mins. if you not set this property, the QueueFeeder will not filter the url and log output may like this QueueFeeder finished: total 36651 records + hit by time limit :0 Do you use bin/crawl command script. it will set the time limit for fetching to 180.

Re: Understanding fetch MapReduce job counters and logs

2013-03-17 Thread Amit Sela
I am using bin/crawl - I'll change the timeLimitFetch to something a bit higher. Thanks! On Sun, Mar 17, 2013 at 5:07 PM, feng lu amuseme...@gmail.com wrote: yes, the property is fetcher.timelimit.mins. if you not set this property, the QueueFeeder will not filter the url and log output may

Any plans to make nutch 1.x support solr cloud?

2013-03-17 Thread adfel70
Hi Are there any plans to make nutch 1.x support solr cloud? I'm using nutch 1.4 and solr 4.0. So far I've managed to work with this because solrj CommonsHttpSolrServer somehow works with solr cloud, though it doesn't exist in solr-4.0. This is inconvenient because CommonsHttpSolrServer gets a

SolrException: An invalid XML character (Unicode: 0xffffffff) was found in the element content of the document.

2013-03-17 Thread neeraj
I am getting following exception when indexing documents to Solr from Nutch. org.apache.solr.common.SolrException: An invalid XML character (Unicode: 0x) was found in the element content of the document. Please let me know on how to resolve this. I am using Nutch 1.6 for crawling and

Re: Any plans to make nutch 1.x support solr cloud?

2013-03-17 Thread Lewis John Mcgibbney
Hi, You are always encouraged to look at our Jira instance before asking questions. It really helps both you and us solve problems efficiently. Please check out https://issues.apache.org/jira/browse/NUTCH-1377 And comment where you can. When we eventually do the entire out of the box upgrade to

Re: SolrException: An invalid XML character (Unicode: 0xffffffff) was found in the element content of the document.

2013-03-17 Thread feng lu
Hi Neeraj schema-solr4.xml does not work with Solr 4.1.0. Maybe you can add this patch[0] and run again. [0] https://issues.apache.org/jira/browse/NUTCH-1486 On Mon, Mar 18, 2013 at 2:34 AM, neeraj neerajbhol...@yahoo.com wrote: I am getting following exception when indexing documents to

Re: SolrException: An invalid XML character (Unicode: 0xffffffff) was found in the element content of the document.

2013-03-17 Thread neeraj
Amuseme, Thanks for the reply. I reviewed the exceptions given on the link and I am not getting any of those. I have more than 5 million documents crawled and was able to index 120 K documents to Solr before this exception occurred for invalid XML character. I was trying to investigate around

Re: SolrException: An invalid XML character (Unicode: 0xffffffff) was found in the element content of the document.

2013-03-17 Thread feng lu
yes, NUTCH-1016 already fixed this problem. The property parser.character.encoding.default is used when EncodingDetctor can not detected the content encoding. It set the defaut encoding to this page content. If this detection is wrong, sometimes it will result unreadable code of parse content.

Re: SolrException: An invalid XML character (Unicode: 0xffffffff) was found in the element content of the document.

2013-03-17 Thread feng lu
I am not sure whether this error is caused by this property parser.character.encoding.default Can you trace this error back to a specific document? So you can create a test enviroment and parserindex that document again. See what happens. On Mon, Mar 18, 2013 at 12:17 PM, neeraj