[jira] Commented: (NUTCH-343) Index MP3 SHA1 hashes

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-343?page=comments#action_12428920 ] Stefan Groschupf commented on NUTCH-343: Thanks for the contribution, also that your patch has a test. :-) Just a small comment from taking a first look to

some questions

2006-08-18 Thread anton
I suggest to use nutch 0.8 on several computers with DFS. But I'm worried about nutch's requirements to HDD free space. For example, suppose I have 1) server with job tracker and namenode 2) 5 servers with task trackers and 20 Gb HDDs 3) 5 servers with datenode and 20 Gb HDDs also

[jira] Updated: (NUTCH-341) IndexMerger now deletes entire workingdir after completing

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ] Stefan Groschupf updated NUTCH-341: --- Attachment: doNotDeleteTmpIndexMergeDirV1.patch +1. I agree it makes completly no sense to be required creating a tmp folder manually and nutch deletes

[jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ] Stefan Groschupf updated NUTCH-337: --- Attachment: respectFetcherParsePropertyV1.patch Hi Jeremy, thanks for catching this. Attached a fix. Should be easy for a contributor to commit this to

[jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ] Stefan Groschupf updated NUTCH-337: --- Priority: Major (was: Trivial) Fetcher ignores the fetcher.parse value configured in config file

[jira] Commented: (NUTCH-345) Add support for Content-Encoding: deflated

2006-08-18 Thread Pascal Beis (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428942 ] Pascal Beis commented on NUTCH-345: --- The DeflateUtils are called by HttpBase in the lib-http plugin, which in turn is called by HttpResponse in the protocol-http

Re: [jira] Resolved: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-08-18 Thread Andrzej Bialecki
Stefan Groschupf (JIRA) wrote: [ http://issues.apache.org/jira/browse/NUTCH-322?page=all ] Stefan Groschupf resolved NUTCH-322. Resolution: Duplicate duplicate of NUTCH-353 ??? If anything, NUTCH-353 is a duplicate of this issue, as it was

[jira] Reopened: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-08-18 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-322?page=all ] Andrzej Bialecki reopened NUTCH-322: - Assignee: Andrzej Bialecki Re-opening - this issue is not resolved yet. Fetcher discards ProtocolStatus, doesn't store redirected

[jira] Commented: (NUTCH-345) Add support for Content-Encoding: deflated

2006-08-18 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428961 ] Andrzej Bialecki commented on NUTCH-345: - Looks ok to me. Minor addition - protocol-httpclient Http.java and HttpResponse.java should be modified too, to

[jira] Updated: (NUTCH-341) IndexMerger now deletes entire workingdir after completing

2006-08-18 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ] Andrzej Bialecki updated NUTCH-341: Attachment: patch-v2.txt I propose another variant of this patch. This version allows you to run multiple mergers at the same time, with the same

Adding Database Field

2006-08-18 Thread Levent Ulutas
Hi there, IŽm from Germany. My english isnŽt so good. iŽm a beginner and i have a Question about Nutch. I want to add a new Field Price (String) in the Database. I need it because i want to search for prices from some products and i want to sort the price in the result.. Can someone help

[jira] Commented: (NUTCH-341) IndexMerger now deletes entire workingdir after completing

2006-08-18 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-341?page=comments#action_12429029 ] Sami Siren commented on NUTCH-341: -- +1 for v2 IndexMerger now deletes entire workingdir after completing

[jira] Commented: (NUTCH-338) Remove the text parser as an option for parsing PDF files in parse-plugins.xml

2006-08-18 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-338?page=comments#action_12429033 ] Chris A. Mattmann commented on NUTCH-338: - Hi Andrzej, A patch is available that you can apply quickly to remove the text parser as an option for pdf.

[jira] Resolved: (NUTCH-347) Build: plugins' Jars not found

2006-08-18 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-347?page=all ] Sami Siren resolved NUTCH-347. -- Fix Version/s: 0.9.0 Resolution: Fixed Assignee: Sami Siren committed Build: plugins' Jars not found --

[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-08-18 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12429035 ] Chris A. Mattmann commented on NUTCH-258: - Hi Folks, A patch is available on this issue. Has anyone who was experiencing the original problem tried out

[jira] Resolved: (NUTCH-338) Remove the text parser as an option for parsing PDF files in parse-plugins.xml

2006-08-18 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-338?page=all ] Sami Siren resolved NUTCH-338. -- Resolution: Fixed This is now committed, thank you. The patch was broken, hopefully I got it right. Remove the text parser as an option for parsing PDF files in

[jira] Commented: (NUTCH-338) Remove the text parser as an option for parsing PDF files in parse-plugins.xml

2006-08-18 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-338?page=comments#action_12429042 ] Chris A. Mattmann commented on NUTCH-338: - Hi Sami, Thanks much. It's weird that it was broken seeing as it was a one line patch, however, I tried it

[jira] Commented: (NUTCH-338) Remove the text parser as an option for parsing PDF files in parse-plugins.xml

2006-08-18 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-338?page=comments#action_12429044 ] Sami Siren commented on NUTCH-338: -- yeah, svn diff from commandline is the winner. Remove the text parser as an option for parsing PDF files in parse-plugins.xml

Re: Thoughts on Parser design and dependencies

2006-08-18 Thread Andrzej Bialecki
Jukka Zitting wrote: The Parser interface is also bound to the ideas of fetching content from the network and indexing it using a standard content model through the Content and Parse dependencies. For the Tika project I'd like to look for ways to generalize this, as neither of these ideas apply

[jira] Closed: (NUTCH-341) IndexMerger now deletes entire workingdir after completing

2006-08-18 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ] Andrzej Bialecki closed NUTCH-341. --- Fix Version/s: 0.8.1 0.9.0 Resolution: Fixed Fixed. Thanks! IndexMerger now deletes entire workingdir after completing

Re: Thoughts on Parser design and dependencies

2006-08-18 Thread Sami Siren
Andrzej Bialecki wrote: Jukka Zitting wrote: The Parser interface is also bound to the ideas of fetching content from the network and indexing it using a standard content model through the Content and Parse dependencies. For the Tika project I'd like to look for ways to generalize this, as

Re: Thoughts on Parser design and dependencies

2006-08-18 Thread Andrzej Bialecki
Sami Siren wrote: Andrzej Bialecki wrote: Jukka Zitting wrote: The Parser interface is also bound to the ideas of fetching content from the network and indexing it using a standard content model through the Content and Parse dependencies. For the Tika project I'd like to look for ways to

Re: Thoughts on Parser design and dependencies

2006-08-18 Thread Sami Siren
Andrzej Bialecki wrote: Sami Siren wrote: Andrzej Bialecki wrote: Jukka Zitting wrote: The Parser interface is also bound to the ideas of fetching content from the network and indexing it using a standard content model through the Content and Parse dependencies. For the Tika project I'd like

Re: Thoughts on Parser design and dependencies

2006-08-18 Thread Andrzej Bialecki
Sami Siren wrote: Original motivation for this was http headers and meta tags, which can have multiple values. Another case is the language identification, where the same key may have multiple values, coming from different sources. Additionally, MapWritable supports any Writable, which is

[jira] Updated: (NUTCH-105) Network error during robots.txt fetch causes file to be ignored

2006-08-18 Thread Greg Kim (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-105?page=all ] Greg Kim updated NUTCH-105: --- Attachment: RobotRulesParser.patch This patch will not cache the robots.txt on network errors/delays; currently we cache EMPTY_RULES (allows everything) for a host X on

[jira] Updated: (NUTCH-105) Network error during robots.txt fetch causes file to be ignored

2006-08-18 Thread Greg Kim (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-105?page=all ] Greg Kim updated NUTCH-105: --- Affects Version/s: 0.8.1 0.9.0 Network error during robots.txt fetch causes file to be ignored

architecture question/thoughts

2006-08-18 Thread bruce
hi... i'm playing around with an app that parses websites and extracts information, returning certain information to my system. my primary issue has to do with how i might architect the system to place the information into my database. i'm using/testing with mysql. my question has to do with how

[jira] Updated: (NUTCH-338) Remove the text parser as an option for parsing PDF files in parse-plugins.xml

2006-08-18 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-338?page=all ] Sami Siren updated NUTCH-338: - Fix Version/s: 0.8.1 Remove the text parser as an option for parsing PDF files in parse-plugins.xml