Which parsers to use with Nutch 1.1?

2010-09-08 Thread Mike Baranczak
The impression that I got from reading the mailing lists is that the developers are slowly moving to deprecate all the parser plugins in favor of Tika - but that this process is not quite finished in the 1.1 release, and that the Tika plugin is still a little wonky. Is this correct? -MB -- View

RE: Subcollection Plugin issue - Branch 1.2

2010-09-08 Thread Nemani, Raj
I think I resolved the issue The way to setup the subcollections.xml is NOT this stylebook sb http://mysite.mydomain.com/guidance/ http://mysite/guidance/ It needs to be setup the following way. stylebook sb http://cnnlibrary.turner.com/guidance/ http://cnnlibrary/guidance/ Each pa

Re: Cygwin

2010-09-08 Thread Yavuz Selim YILMAZ
As Raj said, I forget to build the source, you can build the source with ant. -- Yavuz Selim YILMAZ 2010/9/8 Richard Huang > Can u share how you resolve it? Thanks. > > Sent from my iPhone > > On Sep 8, 2010, at 1:33 AM, Yavuz Selim YILMAZ > wrote: > > > Ok Raj, I solved the problem, thnx. >

Re: Mime type via index-more plugin

2010-09-08 Thread Markus Jelsma
I'll try to give it a shot this week for the 1.2 branch and trunk if it isn't too different. It shouldn't be too hard and Julien's explanation on how to read the configuration makes a lot of sense. On Wednesday 08 September 2010 16:37:29 Mattmann, Chris A (388J) wrote: > Hi Markus, > > > Inter

Re: Mime type via index-more plugin

2010-09-08 Thread Mattmann, Chris A (388J)
Hi Markus, > Interesting! But can the mime extractor return more than one type for a given > file in Nutch? Sure, Nutch metadata is a named Field->multi-value structure so a file (or piece of content) can certainly have more than 1 type. > I see, but in that case it would be helpful if the canon

Re: Mime type via index-more plugin

2010-09-08 Thread Markus Jelsma
Hello Chris, On Wednesday 08 September 2010 16:17:30 Mattmann, Chris A (388J) wrote: > Hi Markus, > > In fact, there are plenty of times that files have > 1 mime type. There is > an entire classification scheme from IANA that defines parent-child > relationships between mime type (such as the n

Re: Mime type via index-more plugin

2010-09-08 Thread Mattmann, Chris A (388J)
Hi Markus, In fact, there are plenty of times that files have > 1 mime type. There is an entire classification scheme from IANA that defines parent-child relationships between mime type (such as the notion that text/xml is a descendant of text/plain). The current index-more plugin splits up mi

Re: Dynamic add slave to nutch cluster

2010-09-08 Thread Julien Nioche
the hadoop user list would be a better place to ask this 2010/9/8 yi zhu > > I've run a 2-datanode-cluster to do crawling job, now I need to add one new > node to the cluster without stop the cluster > > I add a new line in conf/slaves ,and what should I do next? stop-all.sh and > start-all.sh s

RE: Solr and Nutch

2010-09-08 Thread Thumuluri, Sai
Thank you - We are able to see the meta data on the Nutch front using bin/nutch org.apache.nutch.parse.ParserChecker *, but cannot see the metadata on the Solr side. We have added metadata fields in solrmapping and also checked our schema.xml on both nutch and solr. Are there any additional conf

Re: Solr and Nutch

2010-09-08 Thread André Ricardo
One plugin can add multiple and different fields. In the schema.xml you can map your new fields coming from Nutch. But I don't really know about solrmapping.xml. On 10/09/08 07:35, Yavuz Selim YILMAZ wrote: More than one field, then define a new plugin per new metadata? Differenet pages ha

Re: Compiling Gora to compile Nutch Trunk fails with ANt Runtime issue

2010-09-08 Thread Enis Soztutar
The message you sent is not an error, it is a warning. It should still compile. Please follow the steps at http://github.com/enis/gora Cheers, Enis On Thu, Sep 2, 2010 at 10:16 PM, Nemani, Raj wrote: > All, > > > > I am trying to compile Gora to compile latest lNutch turnk. I am doing > the fo

Re: Nutch 2.0 Help

2010-09-08 Thread Enis Soztutar
Hi, I think we need to commit all the necessary files to nutch so that it can work out of the box for sql, hbase and casssandra. We can even write commented-out entries in gora.properties, nutch-site.xml, etc so that using nutch with different backends becomes a configuration change. I will open a

Dynamic add slave to nutch cluster

2010-09-08 Thread yi zhu
I've run a 2-datanode-cluster to do crawling job, now I need to add one new node to the cluster without stop the cluster I add a new line in conf/slaves ,and what should I do next? stop-all.sh and start-all.sh should work, but they seem to stop all runing job in the cluster

Re: Cygwin

2010-09-08 Thread Richard Huang
Can u share how you resolve it? Thanks. Sent from my iPhone On Sep 8, 2010, at 1:33 AM, Yavuz Selim YILMAZ wrote: > Ok Raj, I solved the problem, thnx. > -- > > Yavuz Selim YILMAZ > > > 2010/9/7 Nemani, Raj > >> Oh wait, Your command looks wrong too (dunno if that was a typo) >> >> I

Re: Mime type via index-more plugin

2010-09-08 Thread Julien Nioche
> Perhaps someone could give a pointer on how to read a configuration setting > for a plug-in and where to store the setting (Nutch config or plugin.xml) > and > i might actually write my first Java code again since four years! > You'd typically do that by adding something like * conf.getBoolean("

Re: Nutch 2.0 Help

2010-09-08 Thread Julien Nioche
Hi guys, I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on http://wiki.apache.org/nutch/GORA_HBase Feel free to amend and improve as you see fit. Please bear in mind that Nutch 2.0 is at a very early stage and is far from being bug-proof, see in particular [1]. HTH Ju

Re: Mime type via index-more plugin

2010-09-08 Thread Markus Jelsma
Julien, I've filed an issue [1], but i cannot, at this moment, provide a patch that enables configuration of this feature. I did disable it in my check out though. Perhaps someone could give a pointer on how to read a configuration setting for a plug-in and where to store the setting (Nutch con

Re: ERROR tika.TikaParser org.apache.pdfbox.io.PushBackInputStream

2010-09-08 Thread Markus Jelsma
This description fooled me too once but it hasn't been patched yet? Now it is [1], please commit. [1]: https://issues.apache.org/jira/browse/NUTCH-900 On Wednesday 14 July 2010 07:10:47 Mattmann, Chris A (388J) wrote: > No problem, Brad! If you'd like feel free to create an issue in Nutch JIRA >

Re: Mime type via index-more plugin

2010-09-08 Thread Julien Nioche
Hi Markus, Your analysis is correct, see the comments in the MoreIndexingFilter * * Add Content-Type and its primaryType and subType add contentType, * primaryType and subType to field "type" as un-stored, indexed and * un-tokenized, so that search results can be confined by contentT

Re: Mime type via index-more plugin

2010-09-08 Thread Markus Jelsma
I've checked the MoreIndexingFilter sources and my suspicions were right, it really splits the input in the getParts method. I'd love to have this removed and committed, but i guess more work is needed to keep it compatible such as tokenizing it to keep it searchable, which would require an sche

Mime type via index-more plugin

2010-09-08 Thread Markus Jelsma
Hi, I'm testing the index-more plug-in but, to my surprise, it is defined as a multi valued field in the shipped Solr schema configuration! Since when do files have more than one mime type? Well, they don't! It seems the plug-in splits mime types by slash and exports three terms per document,

Re: Solr and Nutch

2010-09-08 Thread Yavuz Selim YILMAZ
Also, for html, should metadata be at the "head", can it be in "body" ? -- Yavuz Selim YILMAZ 2010/9/8 Yavuz Selim YILMAZ > More than one field, then define a new plugin per new metadata? > > Differenet pages have different extra metadatas, then would it be > configured in schema.xml and solrm