Hi Kiran, On Wednesday, January 16, 2013, kiran chitturi <[email protected]> wrote: > > Can i make changes to this parameter ALL_FIELDS and then try to dump the > fields based on the user input ? This command might look like './bin/nutch > readdb -dump baseUrl $OUTPUT'.
I assume by $OUTPUT you mean the field to pass as a param for the mapper job? > I might need to go and take a look at Gora API as you suggested, if i want > a command like > './bin/nutch readdb -dump baseUrl -condition parseStatus 2 $OUTPUT' to dump > baseUrl's based on the field values. You would be good to head to user@gora as this kind of querying is a key part of Gora functionality. > > Do you think something like this is meaningful to implement in Nutch 2.x ? Most certainly, anything that gives us a mechanism to obtain fine grained querying of the webdb can only be a good thing right? > > I feel, its a great thing if nutch can do this instead of doing out of box > work with database since we can different kind of databases using Gora. +1 > > Please let me know your suggestions. > > Thanks, > Kiran. > > [0] > http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?view=markup > > On Wed, Jan 16, 2013 at 3:54 PM, Lewis John Mcgibbney < > [email protected]> wrote: > >> Hi Kiran, >> >> For this I think you are looking at diving further into the Gora API and >> codebase. >> As you can see around line 232 [0], the Query is set and executed based on >> the key. >> What you wish to do would possible encompass setting fields via the Gora >> Query API. There are some other useful methods in there which you could use >> for your specific requirements. >> If you find something which you think we could integrate into the >> WebTableReader in a more widely applicable manner then by all means please >> log a Jira, however I think that writing your own custom class to cut of >> all of the stuff you don't need from the existing WebTableReader may be the >> best route to take. >> Of course this may be wrong for me to say... >> >> Lewis >> >> [0] >> >> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java?view=markup >> [1] >> >> http://svn.apache.org/repos/asf/gora/trunk/gora-core/src/main/java/org/apache/gora/query/Query.java >> >> On Wed, Jan 16, 2013 at 9:35 AM, kiran chitturi >> <[email protected]>wrote: >> >> > If i want to fetch the list of urls based on the value of a field in the >> > database (like parseStatus, protocolStatus), are there any direct tricks >> or >> > commands for it rather than dumping the webpage (without content and >> text) >> > and searching inside. >> > >> > For example a command like './bin/nutch readdb -dump $FIELD_NAME >> > $FIELD_VALUE $LOCATION', might be quite useful when trying to look in to >> > the database after reading stats of the crawl and trying to figure out >> > which urls are under (status_redir_temp, status_redir_perm, status_retry, >> > status_gone, status_unfetched, status_fetched). >> > >> > Are there any tips/tricks when trying to deal with large data and trying >> to >> > dump urls based on parseStatus ? >> > >> > The documentation here (http://wiki.apache.org/nutch/bin/nutch_readdb) >> > might not apply to 2.x series. >> > >> > A page with commands and examples will be very helpful. Can we try to >> > create all new documentation separating 2.x and 1.x series ? >> > >> > >> > Thanks, >> > >> > -- >> > Kiran Chitturi >> > >> >> >> >> -- >> *Lewis* >> > > > > -- > Kiran Chitturi > -- *Lewis*

