Re: Nutch 2.x : readdb command dump

Lewis John Mcgibbney Wed, 16 Jan 2013 12:55:12 -0800

Hi Kiran,

For this I think you are looking at diving further into the Gora API and
codebase.
As you can see around line 232 [0], the Query is set and executed based on
the key.
What you wish to do would possible encompass setting fields via the Gora
Query API. There are some other useful methods in there which you could use
for your specific requirements.
If you find something which you think we could integrate into the
WebTableReader in a more widely applicable manner then by all means please
log a Jira, however I think that writing your own custom class to cut of
all of the stuff you don't need from the existing WebTableReader may be the
best route to take.
Of course this may be wrong for me to say...


Lewis

[0]
http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?view=markup
[1]
http://svn.apache.org/repos/asf/gora/trunk/gora-core/src/main/java/org/apache/gora/query/Query.java

On Wed, Jan 16, 2013 at 9:35 AM, kiran chitturi
<[email protected]>wrote:

> If i want to fetch the list of urls based on the value of a field in the
> database (like parseStatus, protocolStatus), are there any direct tricks or
> commands for it rather than dumping the webpage (without content and text)
> and searching inside.
>
> For example a command like './bin/nutch readdb -dump $FIELD_NAME
> $FIELD_VALUE $LOCATION', might be quite useful when trying to look in to
> the database after reading stats of the crawl and trying to figure out
> which urls are under (status_redir_temp, status_redir_perm, status_retry,
> status_gone, status_unfetched, status_fetched).
>
> Are there any tips/tricks when trying to deal with large data and trying to
> dump urls based on parseStatus ?
>
> The documentation here (http://wiki.apache.org/nutch/bin/nutch_readdb)
> might not apply to 2.x series.
>
> A page with commands and examples will be very helpful. Can we try to
> create all new documentation separating 2.x and 1.x series ?
>
>
> Thanks,
>
> --
> Kiran Chitturi
>



-- 
*Lewis*

Re: Nutch 2.x : readdb command dump

Reply via email to