Re: Nutch 2.x : readdb command dump

Lewis John Mcgibbney Wed, 16 Jan 2013 21:40:35 -0800

Hi Kiran,

On Wednesday, January 16, 2013, kiran chitturi <[email protected]>
wrote:
>
> Can i make changes to this parameter ALL_FIELDS and then try to dump the
> fields based on the user input ? This command might look like './bin/nutch
> readdb -dump baseUrl $OUTPUT'.


I assume by $OUTPUT you mean the field to pass as a param for the mapper
job?

> I might need to go and take a look at Gora API as you suggested, if i want
> a command like
> './bin/nutch readdb -dump baseUrl -condition parseStatus 2 $OUTPUT' to
dump
> baseUrl's based on the field values.

You would be good to head to user@gora as this kind of querying is a key
part of Gora functionality.


>
> Do you think something like this is meaningful to implement in Nutch 2.x ?

Most certainly, anything that gives us a mechanism to obtain fine grained
querying of the webdb can only be a good thing right?

>
> I feel, its a great thing if nutch can do this instead of doing out of box
> work with database since we can different kind of databases using Gora.

+1

>
> Please let me know your suggestions.
>
> Thanks,
> Kiran.
>
> [0]
>
http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?view=markup
>
> On Wed, Jan 16, 2013 at 3:54 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> Hi Kiran,
>>
>> For this I think you are looking at diving further into the Gora API and
>> codebase.
>> As you can see around line 232 [0], the Query is set and executed based
on
>> the key.
>> What you wish to do would possible encompass setting fields via the Gora
>> Query API. There are some other useful methods in there which you could
use
>> for your specific requirements.
>> If you find something which you think we could integrate into the
>> WebTableReader in a more widely applicable manner then by all means
please
>> log a Jira, however I think that writing your own custom class to cut of
>> all of the stuff you don't need from the existing WebTableReader may be
the
>> best route to take.
>> Of course this may be wrong for me to say...
>>
>> Lewis
>>
>> [0]
>>
>>
http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?view=markup
>> [1]
>>
>>
http://svn.apache.org/repos/asf/gora/trunk/gora-core/src/main/java/org/apache/gora/query/Query.java
>>
>> On Wed, Jan 16, 2013 at 9:35 AM, kiran chitturi
>> <[email protected]>wrote:
>>
>> > If i want to fetch the list of urls based on the value of a field in
the
>> > database (like parseStatus, protocolStatus), are there any direct
tricks
>> or
>> > commands for it rather than dumping the webpage (without content and
>> text)
>> > and searching inside.
>> >
>> > For example a command like './bin/nutch readdb -dump $FIELD_NAME
>> > $FIELD_VALUE $LOCATION', might be quite useful when trying to look in
to
>> > the database after reading stats of the crawl and trying to figure out
>> > which urls are under (status_redir_temp, status_redir_perm,
status_retry,
>> > status_gone, status_unfetched, status_fetched).
>> >
>> > Are there any tips/tricks when trying to deal with large data and
trying
>> to
>> > dump urls based on parseStatus ?
>> >
>> > The documentation here (http://wiki.apache.org/nutch/bin/nutch_readdb)
>> > might not apply to 2.x series.
>> >
>> > A page with commands and examples will be very helpful. Can we try to
>> > create all new documentation separating 2.x and 1.x series ?
>> >
>> >
>> > Thanks,
>> >
>> > --
>> > Kiran Chitturi
>> >
>>
>>
>>
>> --
>> *Lewis*
>>
>
>
>
> --
> Kiran Chitturi
>

-- 
*Lewis*

Re: Nutch 2.x : readdb command dump

Reply via email to