Re: Nutch 2.x : readdb command dump

kiran chitturi Wed, 16 Jan 2013 13:34:17 -0800

Hi Lewis,

Thanks for your suggestions. I am looking at WebTableReader to make
changes, particularly at line 319 [0]. There the query fields are set and
the parameter ALL_FIELDS from webpage is passed.


Can i make changes to this parameter ALL_FIELDS and then try to dump the
fields based on the user input ? This command might look like './bin/nutch
readdb -dump baseUrl $OUTPUT'.

I might need to go and take a look at Gora API as you suggested, if i want
a command like
'./bin/nutch readdb -dump baseUrl -condition parseStatus 2 $OUTPUT' to dump
baseUrl's based on the field values.

Do you think something like this is meaningful to implement in Nutch 2.x ?

I feel, its a great thing if nutch can do this instead of doing out of box
work with database since we can different kind of databases using Gora.

Please let me know your suggestions.

Thanks,
Kiran.

[0]
http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?view=markup

On Wed, Jan 16, 2013 at 3:54 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Kiran,
>
> For this I think you are looking at diving further into the Gora API and
> codebase.
> As you can see around line 232 [0], the Query is set and executed based on
> the key.
> What you wish to do would possible encompass setting fields via the Gora
> Query API. There are some other useful methods in there which you could use
> for your specific requirements.
> If you find something which you think we could integrate into the
> WebTableReader in a more widely applicable manner then by all means please
> log a Jira, however I think that writing your own custom class to cut of
> all of the stuff you don't need from the existing WebTableReader may be the
> best route to take.
> Of course this may be wrong for me to say...
>
> Lewis
>
> [0]
>
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?view=markup
> [1]
>
> http://svn.apache.org/repos/asf/gora/trunk/gora-core/src/main/java/org/apache/gora/query/Query.java
>
> On Wed, Jan 16, 2013 at 9:35 AM, kiran chitturi
> <[email protected]>wrote:
>
> > If i want to fetch the list of urls based on the value of a field in the
> > database (like parseStatus, protocolStatus), are there any direct tricks
> or
> > commands for it rather than dumping the webpage (without content and
> text)
> > and searching inside.
> >
> > For example a command like './bin/nutch readdb -dump $FIELD_NAME
> > $FIELD_VALUE $LOCATION', might be quite useful when trying to look in to
> > the database after reading stats of the crawl and trying to figure out
> > which urls are under (status_redir_temp, status_redir_perm, status_retry,
> > status_gone, status_unfetched, status_fetched).
> >
> > Are there any tips/tricks when trying to deal with large data and trying
> to
> > dump urls based on parseStatus ?
> >
> > The documentation here (http://wiki.apache.org/nutch/bin/nutch_readdb)
> > might not apply to 2.x series.
> >
> > A page with commands and examples will be very helpful. Can we try to
> > create all new documentation separating 2.x and 1.x series ?
> >
> >
> > Thanks,
> >
> > --
> > Kiran Chitturi
> >
>
>
>
> --
> *Lewis*
>



-- 
Kiran Chitturi

Re: Nutch 2.x : readdb command dump

Reply via email to