Re: Nutch 2.x : readdb command dump

kiran chitturi Wed, 16 Jan 2013 22:56:53 -0800

On Thu, Jan 17, 2013 at 12:39 AM, Lewis John Mcgibbney <
[email protected]> wrote:


> Hi Kiran,
>
> On Wednesday, January 16, 2013, kiran chitturi <[email protected]>
> wrote:
> >
> > Can i make changes to this parameter ALL_FIELDS and then try to dump the
> > fields based on the user input ? This command might look like
> './bin/nutch
> > readdb -dump baseUrl $OUTPUT'.
>
> I assume by $OUTPUT you mean the field to pass as a param for the mapper
> job?

Sorry, i was not clear earlier. The command i wrote was little confusing
and not very explanatory. I mean like a command like this

./bin/nutch readdb -dump -field $FIELD_NAME $OUTPUT_DIR.

$FIELD_NAME is the field name to be dumped, baseUrl is default field that
can be dumped along with any other field requested since it is the key to
distinguish between different records. $OUTPUT_DIR is the directory to dump
the requested fields from the database.

So, my question is whether we can set a single/multiple fields in the query
rather than all the fields like in line in 319 in [0]

[0] -
http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?view=markup

Thanks,
Kiran.

> > I might need to go and take a look at Gora API as you suggested, if i
> want
> > a command like
> > './bin/nutch readdb -dump baseUrl -condition parseStatus 2 $OUTPUT' to
> dump
> > baseUrl's based on the field values.
>
> You would be good to head to user@gora as this kind of querying is a key
> part of Gora functionality.
>
>
> >
> > Do you think something like this is meaningful to implement in Nutch 2.x
> ?
>
> Most certainly, anything that gives us a mechanism to obtain fine grained
> querying of the webdb can only be a good thing right?
>
> >
> > I feel, its a great thing if nutch can do this instead of doing out of
> box
> > work with database since we can different kind of databases using Gora.
>
> +1
>
> >
> > Please let me know your suggestions.
> >
> > Thanks,
> > Kiran.
> >
> > [0]
> >
>
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?view=markup
> >
> > On Wed, Jan 16, 2013 at 3:54 PM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> >> Hi Kiran,
> >>
> >> For this I think you are looking at diving further into the Gora API and
> >> codebase.
> >> As you can see around line 232 [0], the Query is set and executed based
> on
> >> the key.
> >> What you wish to do would possible encompass setting fields via the Gora
> >> Query API. There are some other useful methods in there which you could
> use
> >> for your specific requirements.
> >> If you find something which you think we could integrate into the
> >> WebTableReader in a more widely applicable manner then by all means
> please
> >> log a Jira, however I think that writing your own custom class to cut of
> >> all of the stuff you don't need from the existing WebTableReader may be
> the
> >> best route to take.
> >> Of course this may be wrong for me to say...
> >>
> >> Lewis
> >>
> >> [0]
> >>
> >>
>
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?view=markup
> >> [1]
> >>
> >>
>
> http://svn.apache.org/repos/asf/gora/trunk/gora-core/src/main/java/org/apache/gora/query/Query.java
> >>
> >> On Wed, Jan 16, 2013 at 9:35 AM, kiran chitturi
> >> <[email protected]>wrote:
> >>
> >> > If i want to fetch the list of urls based on the value of a field in
> the
> >> > database (like parseStatus, protocolStatus), are there any direct
> tricks
> >> or
> >> > commands for it rather than dumping the webpage (without content and
> >> text)
> >> > and searching inside.
> >> >
> >> > For example a command like './bin/nutch readdb -dump $FIELD_NAME
> >> > $FIELD_VALUE $LOCATION', might be quite useful when trying to look in
> to
> >> > the database after reading stats of the crawl and trying to figure out
> >> > which urls are under (status_redir_temp, status_redir_perm,
> status_retry,
> >> > status_gone, status_unfetched, status_fetched).
> >> >
> >> > Are there any tips/tricks when trying to deal with large data and
> trying
> >> to
> >> > dump urls based on parseStatus ?
> >> >
> >> > The documentation here (http://wiki.apache.org/nutch/bin/nutch_readdb
> )
> >> > might not apply to 2.x series.
> >> >
> >> > A page with commands and examples will be very helpful. Can we try to
> >> > create all new documentation separating 2.x and 1.x series ?
> >> >
> >> >
> >> > Thanks,
> >> >
> >> > --
> >> > Kiran Chitturi
> >> >
> >>
> >>
> >>
> >> --
> >> *Lewis*
> >>
> >
> >
> >
> > --
> > Kiran Chitturi
> >
>
> --
> *Lewis*
>



-- 
Kiran Chitturi

Re: Nutch 2.x : readdb command dump

Reply via email to