[
http://issues.apache.org/jira/browse/NUTCH-343?page=comments#action_12428920 ]
Stefan Groschupf commented on NUTCH-343:
Thanks for the contribution, also that your patch has a test. :-)
Just a small comment from taking a first look to
I suggest to use nutch 0.8 on several computers with DFS. But I'm worried
about nutch's requirements to HDD free space.
For example, suppose I have
1) server with job tracker and namenode
2) 5 servers with task trackers and 20 Gb HDDs
3) 5 servers with datenode and 20 Gb HDDs also
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ]
Stefan Groschupf updated NUTCH-341:
---
Attachment: doNotDeleteTmpIndexMergeDirV1.patch
+1.
I agree it makes completly no sense to be required creating a tmp folder
manually and nutch deletes
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ]
Stefan Groschupf updated NUTCH-337:
---
Attachment: respectFetcherParsePropertyV1.patch
Hi Jeremy, thanks for catching this. Attached a fix. Should be easy for a
contributor to commit this to
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ]
Stefan Groschupf updated NUTCH-337:
---
Priority: Major (was: Trivial)
Fetcher ignores the fetcher.parse value configured in config file
[
http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428942 ]
Pascal Beis commented on NUTCH-345:
---
The DeflateUtils are called by HttpBase in the lib-http plugin, which in turn
is called by
HttpResponse in the protocol-http
Stefan Groschupf (JIRA) wrote:
[ http://issues.apache.org/jira/browse/NUTCH-322?page=all ]
Stefan Groschupf resolved NUTCH-322.
Resolution: Duplicate
duplicate of NUTCH-353
??? If anything, NUTCH-353 is a duplicate of this issue, as it was
[ http://issues.apache.org/jira/browse/NUTCH-322?page=all ]
Andrzej Bialecki reopened NUTCH-322:
-
Assignee: Andrzej Bialecki
Re-opening - this issue is not resolved yet.
Fetcher discards ProtocolStatus, doesn't store redirected
[
http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428961 ]
Andrzej Bialecki commented on NUTCH-345:
-
Looks ok to me. Minor addition - protocol-httpclient Http.java and
HttpResponse.java should be modified too, to
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ]
Andrzej Bialecki updated NUTCH-341:
Attachment: patch-v2.txt
I propose another variant of this patch. This version allows you to run
multiple mergers at the same time, with the same
Hi there,
IŽm from Germany. My english isnŽt so good.
iŽm a beginner and i have a Question about Nutch.
I want to add a new Field Price (String) in the Database.
I need it because i want to search for prices from some products and i want to
sort the price in the result..
Can someone help
[
http://issues.apache.org/jira/browse/NUTCH-341?page=comments#action_12429029 ]
Sami Siren commented on NUTCH-341:
--
+1 for v2
IndexMerger now deletes entire workingdir after completing
[
http://issues.apache.org/jira/browse/NUTCH-338?page=comments#action_12429033 ]
Chris A. Mattmann commented on NUTCH-338:
-
Hi Andrzej,
A patch is available that you can apply quickly to remove the text parser as
an option for pdf.
[ http://issues.apache.org/jira/browse/NUTCH-347?page=all ]
Sami Siren resolved NUTCH-347.
--
Fix Version/s: 0.9.0
Resolution: Fixed
Assignee: Sami Siren
committed
Build: plugins' Jars not found
--
[
http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12429035 ]
Chris A. Mattmann commented on NUTCH-258:
-
Hi Folks,
A patch is available on this issue. Has anyone who was experiencing the
original problem tried out
[ http://issues.apache.org/jira/browse/NUTCH-338?page=all ]
Sami Siren resolved NUTCH-338.
--
Resolution: Fixed
This is now committed, thank you.
The patch was broken, hopefully I got it right.
Remove the text parser as an option for parsing PDF files in
[
http://issues.apache.org/jira/browse/NUTCH-338?page=comments#action_12429042 ]
Chris A. Mattmann commented on NUTCH-338:
-
Hi Sami,
Thanks much. It's weird that it was broken seeing as it was a one line patch,
however, I tried it
[
http://issues.apache.org/jira/browse/NUTCH-338?page=comments#action_12429044 ]
Sami Siren commented on NUTCH-338:
--
yeah, svn diff from commandline is the winner.
Remove the text parser as an option for parsing PDF files in parse-plugins.xml
Jukka Zitting wrote:
The Parser interface is also bound to the ideas of fetching content
from the network and indexing it using a standard content model
through the Content and Parse dependencies. For the Tika project I'd
like to look for ways to generalize this, as neither of these ideas
apply
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ]
Andrzej Bialecki closed NUTCH-341.
---
Fix Version/s: 0.8.1
0.9.0
Resolution: Fixed
Fixed. Thanks!
IndexMerger now deletes entire workingdir after completing
Andrzej Bialecki wrote:
Jukka Zitting wrote:
The Parser interface is also bound to the ideas of fetching content
from the network and indexing it using a standard content model
through the Content and Parse dependencies. For the Tika project I'd
like to look for ways to generalize this, as
Sami Siren wrote:
Andrzej Bialecki wrote:
Jukka Zitting wrote:
The Parser interface is also bound to the ideas of fetching content
from the network and indexing it using a standard content model
through the Content and Parse dependencies. For the Tika project I'd
like to look for ways to
Andrzej Bialecki wrote:
Sami Siren wrote:
Andrzej Bialecki wrote:
Jukka Zitting wrote:
The Parser interface is also bound to the ideas of fetching content
from the network and indexing it using a standard content model
through the Content and Parse dependencies. For the Tika project I'd
like
Sami Siren wrote:
Original motivation for this was http headers and meta tags, which
can have multiple values. Another case is the language
identification, where the same key may have multiple values, coming
from different sources. Additionally, MapWritable supports any
Writable, which is
[ http://issues.apache.org/jira/browse/NUTCH-105?page=all ]
Greg Kim updated NUTCH-105:
---
Attachment: RobotRulesParser.patch
This patch will not cache the robots.txt on network errors/delays; currently
we cache EMPTY_RULES (allows everything) for a host X on
[ http://issues.apache.org/jira/browse/NUTCH-105?page=all ]
Greg Kim updated NUTCH-105:
---
Affects Version/s: 0.8.1
0.9.0
Network error during robots.txt fetch causes file to be ignored
hi...
i'm playing around with an app that parses websites and extracts
information, returning certain information to my system.
my primary issue has to do with how i might architect the system to place
the information into my database. i'm using/testing with mysql. my question
has to do with how
[ http://issues.apache.org/jira/browse/NUTCH-338?page=all ]
Sami Siren updated NUTCH-338:
-
Fix Version/s: 0.8.1
Remove the text parser as an option for parsing PDF files in parse-plugins.xml
28 matches
Mail list logo