[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-31 Thread Matt MacDonald (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445841#comment-13445841 ] Matt MacDonald commented on NUTCH-1445: --- Hi, I'm attempting to use the

[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445849#comment-13445849 ] Ferdy Galema commented on NUTCH-1445: - Hi Matt, Sure we can resolve your issue here.

[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445850#comment-13445850 ] Ferdy Galema commented on NUTCH-1445: - (feature requests should be future requests

[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-31 Thread Matt MacDonald (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445860#comment-13445860 ] Matt MacDonald commented on NUTCH-1445: --- Ferdy, Thanks for the help. I'll

Need some directions

2012-08-31 Thread Vijith
Hi all, I am new to dev... I am working on NUTCH-1150... I would like to get some directions before I can start... Right now I am going through the Fetcher.java code... I have tried running nutch with a sample site with two different urls redirecting to a common resource. I could not find any

Re: Need some directions

2012-08-31 Thread Vijith
Here is the link to the issue - https://issues.apache.org/jira/browse/NUTCH-1150 On Fri, Aug 31, 2012 at 5:37 PM, Vijith vijithkv...@gmail.com wrote: Hi all, I am new to dev... I am working on NUTCH-1150... I would like to get some directions before I can start... Right now I am going

Some questions regarding NUTCH-1150

2012-08-31 Thread Vijith
Hi all, (Please ignore my previous mail, if any) I am new to dev... I am working on NUTCH-1150... https://issues.apache.org/jira/browse/NUTCH-1150 I would like to get some directions before I can start... Right now I am going through the Fetcher.java code... I have tried running nutch with a

[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445871#comment-13445871 ] Ferdy Galema commented on NUTCH-1445: - Ah I got it now. It's definitely a bug. When

[jira] [Created] (NUTCH-1462) Elasticsearch not indexing when type==null in NutchDocument metadata

2012-08-31 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1462: --- Summary: Elasticsearch not indexing when type==null in NutchDocument metadata Key: NUTCH-1462 URL: https://issues.apache.org/jira/browse/NUTCH-1462 Project: Nutch

Re: Some questions regarding NUTCH-1150

2012-08-31 Thread Vijith
I apologize..I was sending to mailing list with out subscribing to it. I found the reply from Lewis (from archive). I will comment directly on the issue. Thanks. On Fri, Aug 31, 2012 at 5:59 PM, Vijith vijithkv...@gmail.com wrote: Hi all, (Please ignore my previous mail, if any) I am new

[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-31 Thread Matt MacDonald (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445872#comment-13445872 ] Matt MacDonald commented on NUTCH-1445: --- Great! I was just looking in

[jira] [Updated] (NUTCH-1462) Elasticsearch not indexing when type==null in NutchDocument metadata

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1462: Attachment: nutch-1462.patch Elasticsearch not indexing when type==null in NutchDocument

Re: Some questions regarding NUTCH-1150

2012-08-31 Thread Lewis John Mcgibbney
No hassle Vijith Thank you Lewis On Fri, Aug 31, 2012 at 1:37 PM, Vijith vijithkv...@gmail.com wrote: I apologize..I was sending to mailing list with out subscribing to it. I found the reply from Lewis (from archive). I will comment directly on the issue. Thanks. On Fri, Aug 31, 2012 at

[jira] [Commented] (NUTCH-1150) http.redirect.max can lead to multiple parses of the same url

2012-08-31 Thread Vijith Kumar V (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445874#comment-13445874 ] Vijith Kumar V commented on NUTCH-1150: --- I have tried running nutch with a sample

[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445878#comment-13445878 ] Ferdy Galema commented on NUTCH-1445: - Created NUTCH-1462 for a fix. For a quick-fix

[jira] [Closed] (NUTCH-1462) Elasticsearch not indexing when type==null in NutchDocument metadata

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1462. --- Resolution: Fixed committed Elasticsearch not indexing when type==null in

[jira] [Created] (NUTCH-1463) Elasticsearch indexer should wait and check response for last flush

2012-08-31 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1463: --- Summary: Elasticsearch indexer should wait and check response for last flush Key: NUTCH-1463 URL: https://issues.apache.org/jira/browse/NUTCH-1463 Project: Nutch

[jira] [Updated] (NUTCH-1463) Elasticsearch indexer should wait and check response for last flush

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1463: Attachment: nutch-1463.patch Elasticsearch indexer should wait and check response for last

[jira] [Closed] (NUTCH-1463) Elasticsearch indexer should wait and check response for last flush

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1463. --- Resolution: Fixed committed. Elasticsearch indexer should wait and check response

[jira] [Closed] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1448. --- Resolution: Fixed Committed. Redirected urls should be handled more cleanly (more

[jira] [Commented] (NUTCH-1150) http.redirect.max can lead to multiple parses of the same url

2012-08-31 Thread Vijith V (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445903#comment-13445903 ] Vijith V commented on NUTCH-1150: - Here is my setup. Page1 (only seed) has links to Page2

[jira] [Commented] (NUTCH-1150) http.redirect.max can lead to multiple parses of the same url

2012-08-31 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445907#comment-13445907 ] Markus Jelsma commented on NUTCH-1150: -- Ah, i assume you're doing the parse step

[jira] [Commented] (NUTCH-1150) http.redirect.max can lead to multiple parses of the same url

2012-08-31 Thread Vijith V (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445918#comment-13445918 ] Vijith V commented on NUTCH-1150: - Yes I was doing so. Thanks. I tried with fetcher.parse.

[jira] [Comment Edited] (NUTCH-1150) http.redirect.max can lead to multiple parses of the same url

2012-08-31 Thread Vijith V (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445918#comment-13445918 ] Vijith V edited comment on NUTCH-1150 at 9/1/12 12:33 AM: -- Yes I

Re: Need some directions

2012-08-31 Thread Vijith
I have tried running nutch with a sample site with two different urls redirecting to a common resource. I could not find any clues, from hadoop.log, where the common resource is parsed multiple times. Could some one please explain the exact scenario that creates this bug. And how does this bug

[jira] [Commented] (NUTCH-1100) SolrDedup broken

2012-08-31 Thread Luca Cavanna (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445930#comment-13445930 ] Luca Cavanna commented on NUTCH-1100: - I agree, it would make even more sense to

RE: Need some directions

2012-08-31 Thread Markus Jelsma
-Original message- From:Vijith vijithkv...@gmail.com Sent: Fri 31-Aug-2012 15:44 To: dev@nutch.apache.org Subject: Re: Need some directions I have tried running nutch with a sample site with two different urls redirecting to a common resource. I could not find any clues, from

[jira] [Created] (NUTCH-1464) index-static plugin doesn't allow the colon within the field value

2012-08-31 Thread Luca Cavanna (JIRA)
Luca Cavanna created NUTCH-1464: --- Summary: index-static plugin doesn't allow the colon within the field value Key: NUTCH-1464 URL: https://issues.apache.org/jira/browse/NUTCH-1464 Project: Nutch

[jira] [Updated] (NUTCH-1464) index-static plugin doesn't allow the colon within the field value

2012-08-31 Thread Luca Cavanna (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated NUTCH-1464: Description: If I want to configure a static field with a value containing a colon, the

[jira] [Commented] (NUTCH-1464) index-static plugin doesn't allow the colon within the field value

2012-08-31 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445937#comment-13445937 ] Lewis John McGibbney commented on NUTCH-1464: - Nice catch Luca. Do you have a

[jira] [Updated] (NUTCH-1464) index-static plugin doesn't allow the colon within the field value

2012-08-31 Thread Luca Cavanna (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated NUTCH-1464: Attachment: NUTCH-1464.patch I do have a patch, but it's against 1.5 branch. Anyway it's really

JIRA Nutch 968, File Protocol error 404 while fetching files that contains CJK character in the file name

2012-08-31 Thread Ye T Thet
Hi Folks, There is an issue with protocol-file plugin in while fetching files that contain CJK characters in the file name. JIRA Nutch 968 After I checked the code, I discovered that the problem due to the encoding in the file name while fetching the directory. After changing couple of lines as

[jira] [Commented] (NUTCH-1100) SolrDedup broken

2012-08-31 Thread Luca Cavanna (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446002#comment-13446002 ] Luca Cavanna commented on NUTCH-1100: - The problem with the approach I mentioned

Re: JIRA Nutch 968, File Protocol error 404 while fetching files that contains CJK character in the file name

2012-08-31 Thread Lewis John Mcgibbney
Hi Ye, Please feel free to comment fully on any issue you find onthe Nutch Jira. If you find other/additional bugs or improvements when are not already opened on the Jira instance then please feel free to open ones once you are sure they are not duplicates and/or can be resolved via the user@

Re: JIRA Nutch 968, File Protocol error 404 while fetching files that contains CJK character in the file name

2012-08-31 Thread Ye T Thet
Thanks for the welcome, The issue is due to the encoding in the file name. To fix it, I needed to make two changes in FileResponse.java in protocol-file plugin. The fixes were for temp solution thus I hard coded the encoding to utf-8. It would be better idea to read the encoding from the

[jira] [Closed] (NUTCH-1431) Introduce link 'distance' and add configurable max distance in the generator

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1431. --- Resolution: Fixed committed Introduce link 'distance' and add configurable max

[jira] [Commented] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

2012-08-31 Thread Christian Johnsson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446380#comment-13446380 ] Christian Johnsson commented on NUTCH-1448: --- Will this affect the outlink and

[jira] [Commented] (NUTCH-1461) Problem with TableUtil

2012-08-31 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446396#comment-13446396 ] Lewis John McGibbney commented on NUTCH-1461: - Hi Christian, you make some

[jira] [Commented] (NUTCH-1461) Problem with TableUtil

2012-08-31 Thread Christian Johnsson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446487#comment-13446487 ] Christian Johnsson commented on NUTCH-1461: --- Sure, this one should do the trick.

[jira] [Updated] (NUTCH-1461) Problem with TableUtil

2012-08-31 Thread Christian Johnsson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Johnsson updated NUTCH-1461: -- Attachment: TabelUtil_Fix.patch Quick fix incase there are some non valid domains in

[jira] [Comment Edited] (NUTCH-1461) Problem with TableUtil

2012-08-31 Thread Christian Johnsson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446487#comment-13446487 ] Christian Johnsson edited comment on NUTCH-1461 at 9/1/12 10:27 AM:

[jira] [Comment Edited] (NUTCH-1461) Problem with TableUtil

2012-08-31 Thread Christian Johnsson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446487#comment-13446487 ] Christian Johnsson edited comment on NUTCH-1461 at 9/1/12 10:43 AM:

[jira] [Commented] (NUTCH-872) Change the default fetcher.parse to FALSE

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446511#comment-13446511 ] Ferdy Galema commented on NUTCH-872: Yes that is correct. Change the

[jira] [Commented] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446515#comment-13446515 ] Ferdy Galema commented on NUTCH-1448: - Yes it does show up as an outlink. About your

[jira] [Commented] (NUTCH-1461) Problem with TableUtil

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446518#comment-13446518 ] Ferdy Galema commented on NUTCH-1461: - Added comment in NUTCH-1448.

[jira] [Commented] (NUTCH-872) Change the default fetcher.parse to FALSE

2012-08-31 Thread Christian Johnsson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446568#comment-13446568 ] Christian Johnsson commented on NUTCH-872: -- I applied the patch and did a test run

[jira] [Commented] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

2012-08-31 Thread Christian Johnsson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446570#comment-13446570 ] Christian Johnsson commented on NUTCH-1448: --- Thank you for the information. Yes

Build failed in Jenkins: Nutch-nutchgora #334

2012-08-31 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-nutchgora/334/changes Changes: [ferdy] NUTCH-1431 Introduce link 'distance' and add configurable max distance in the generator [ferdy] NUTCH-1448 Redirected urls should be handled more cleanly (more like an outlink url) [ferdy] NUTCH-1463 Elasticsearch

[jira] [Commented] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

2012-08-31 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446589#comment-13446589 ] Hudson commented on NUTCH-1448: --- Integrated in Nutch-nutchgora #334 (See

[jira] [Commented] (NUTCH-1462) Elasticsearch not indexing when type==null in NutchDocument metadata

2012-08-31 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446591#comment-13446591 ] Hudson commented on NUTCH-1462: --- Integrated in Nutch-nutchgora #334 (See