Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-24 Thread BlackIce
I have been able to compile under OpenJDK 11 Have not done anything further so far I'm gonna try to get to it this evening Greetz Ralf On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma wrote: > > Hi, > > Everything seems fine, the crawler seems fine when trying the binary > distribution. The source

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-24 Thread BlackIce
so far... it doesn't select anything when creating segments: 0 records selected for fetching, exiting On Wed, Aug 24, 2022 at 3:02 PM BlackIce wrote: > > I have been able to compile under OpenJDK 11 > Have not done anything further so far > I'm gonna try to get to it t

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-24 Thread BlackIce
nevermind I made a typo... It fetches it parses On Thu, Aug 25, 2022 at 3:42 AM BlackIce wrote: > > so far... it doesn't select anything when creating segments: > 0 records selected for fetching, exiting > > On Wed, Aug 24, 2022 at 3:02 PM BlackIce wrote: > > >

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-29 Thread BlackIce
2:05 schreef Sebastian Nagel > : > > > Hi Ralf, > > > > > It fetches it parses > > > > So a +1 ? > > > > Best, > > Sebastian > > > > On 8/25/22 05:22, BlackIce wrote: > > > nevermind I made a typo... > > > > > &

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-30 Thread BlackIce
Tried some indexing... but when manually doing "Invertilinks" it says something about input path does not exist. Has invertilinks changed since 1.18? Greetz RRK On Mon, Aug 29, 2022 at 3:38 PM BlackIce wrote: > > Haven't indexed anything to solr.. gonna give it a shot in a

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-30 Thread BlackIce
OK, I compiled Nutch under JDK11 Did some basic fetching, parsing, linkinversion and posterior indexing to Solr 9 [+1] Great work! RRK On Tue, Aug 30, 2022 at 12:22 PM BlackIce wrote: > > Tried some indexing... but when manually doing "Invertilinks" it says > something about

Fwd: Optimizing Nutch 2.2.1

2014-03-18 Thread BlackIce
Hi, I'm Using Nutch 2.2.1, Hbase 0.90.6 in pseudo distributed mode , Hadoop 1.2.1, Java 8 Oracle, Intel I5 Quadcore, 16GB Ram Currently the Fetch cycle is limited by my Internet connection. Parse cycle uses an average of 10% per CPU core Updatedb cycle uses average 3% per CPU core Currently I'

Nutch 2.2.1 pseudo dist, errors

2014-03-18 Thread BlackIce
HI, My first try to run Nutch in pseudo dist, when trying to run any nutch comman from the /runtime/deploy folder I get following error: hduser@bl4ck1c3:/usr/local/nutch2/runtime/deploy$ bin/nutch inject urls Warning: $HADOOP_HOME is deprecated. 14/03/18 16:19:33 INFO crawl.InjectorJob: Injector

Optimizing Nutch 2.2.1

2014-03-18 Thread BlackIce
Hi, I'm Using Nutch 2.2.1, Hbase 0.90.6 in pseudo distributed mode , Hadoop 1.2.1, Java 8 Oracle, Intel I5 Quadcore, 16GB Ram Currently the Fetch cycle is limited by my Internet connection. Parse cycle uses an average of 10% per CPU core Updatedb cycle uses average 3% per CPU core Currently I'

Re: Optimizing Nutch 2.2.1

2014-03-19 Thread BlackIce
count. > But optimization is very general concept. You should tune Nutch, Hdfs, > Jobtracker and Hbase settings. > > Good luck ;) > > > 2014-03-18 14:00 GMT+02:00 BlackIce : > > > Hi, > > > > I'm Using Nutch 2.2.1, Hbase 0.90.6 in pseudo distributed mode , H

solrdedup crashing in pseudo distributed mode (Nutch 2.2.1)

2014-03-19 Thread BlackIce
HI I managed to get NUtch 2.2.1 running in pseudoi distributed mode by making sure all libs are the same version across de Hadoop/Hbase/Nutch essemble. However, now when using the crawl script, the solrdedup job fails with: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.

Re: Book of Nutch

2014-03-19 Thread BlackIce
I skimmed this book as well, It saves a lot of time not having to Google all the info yourself. It also expands on some of things, so it clarified many things for me It is a very good starting point for a noob like me! I Agree on the Title, it's a getting started book On Wed, Mar 19, 2014 at

Re: Optimizing Nutch 2.2.1

2014-03-20 Thread BlackIce
Mar 2014 20:48 tarihinde "BlackIce" yazdı: > > > Thank you, > > > > what are some good starting points to start tuning? > > > > thnx > > > > > > On Tue, Mar 18, 2014 at 8:20 PM, Talat Uyarer wrote: > > > > > Hi, > >

Re: Nutch 2.2.1 pseudo dist, errors

2014-03-20 Thread BlackIce
On Thu, Mar 20, 2014 at 3:13 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi BlackIce, > > On Wed, Mar 19, 2014 at 3:07 PM, > wrote: > > > > > HI, > > > > My first try to run Nutch in pseudo dist, when trying to run any nutch >

Re: Nutch 2.2.1 pseudo dist, errors

2014-03-21 Thread BlackIce
plugins are located. Each > element may be a relative or absolute path. If absolute, it is used > as is. If relative, it is searched for on the classpath. > > > > 2014-03-20 13:53 GMT+02:00 BlackIce : > > > Thnx Lewis, Hadoop 1.2.1 > > > &g

Re: Nutch 2.2.1 pseudo dist, errors

2014-03-21 Thread BlackIce
(Configuration.java:810) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855) ... 8 more Next step: Downgrade to Java 6? (i'm on 8) On Fri, Mar 21, 2014 at 2:55 PM, BlackIce wrote: > you mean the one located > in /nutch/runtime/local ? > > > > On Thu, Mar 20,

Correct sintax for language-identifier plugin?

2014-03-21 Thread BlackIce
Hi, what is the correct sintax for language-identifier plugin? I have this in my nutch-site.xml: plugin.includes protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor|more)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|

Nutch 2.3 ?

2014-05-02 Thread BlackIce
Any idea on when Nutch 2.3 will be released? Thnx

Solr 4.7 Schema?

2014-05-02 Thread BlackIce
Does anyone have a good nutch/solr 4.7 schema file? Thnx

Nutch 1.8 Solrindexer failing

2014-05-03 Thread BlackIce
HI, playing around with Nutch 1.8 in localmode on Solr 4.7.. When indexing larger crawls 10k and up I get: Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114) at

Re: Nutch 1.8 Solrindexer failing

2014-05-03 Thread BlackIce
ack trace? Probably add more debug info > > in. > > > > This could be due to some disk size issue... > > > > > > On Sat, May 3, 2014 at 8:51 PM, BlackIce wrote: > > > >> HI, playing around with Nutch 1.8 in localmode on Solr 4.7.. > >> &

Nutch 1.8 in pseudo dist error

2014-05-03 Thread BlackIce
Hi, what needs to be copyied over to the HDFS in Nutch 1.8? or what is the command? when trying to run the crawl script under /runtime/deploy I get the following: 14/05/03 14:59:03 INFO fetcher.Fetcher: Fetcher: starting at 2014-05-03 14:59:03 14/05/03 14:59:03 INFO fetcher.Fetcher: Fetcher: segm

Re: Nutch 1.8 in pseudo dist error

2014-05-03 Thread BlackIce
ments are named by a time-stamp, e.g. >.../TestCrawl/segments/20140502231126/ > "crawl_generate" is a subdir. > > Can you specify the exact commands to run the crawler? > > Sebastian > > On 05/03/2014 08:30 PM, BlackIce wrote: > > Hi, > > > > what needs

Nutch 1.8 CrawlDb update error

2014-05-04 Thread BlackIce
I get this error now whendoing crawls at 120k each run: 2014-05-04 11:56:44,549 INFO crawl.CrawlDb - CrawlDb update: starting at 2014-05-04 11:56:44 2014-05-04 11:56:44,549 INFO crawl.CrawlDb - CrawlDb update: db: TestCrawl/crawldb 2014-05-04 11:56:44,549 INFO crawl.CrawlDb - CrawlDb update: s

Re: Solr 4.7 Schema?

2014-05-10 Thread BlackIce
Thnx On Wed, May 7, 2014 at 4:07 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi BlackIce, > > On Sat, May 3, 2014 at 10:52 PM, > wrote: > > > > > Does anyone have a good nutch/solr 4.7 schema file? > > > > > > What about t

Nutch 2.x from svn.

2014-05-11 Thread BlackIce
I just installed Nutch 2.x from SVN and Solrindexer is not working, my guess is that it has to dow ith that Solrindexer is now a plug-in, so I activated it in the plug-ins (same as in 1.8) When trying to run crawl script I get: Indexing TestCrawl12 on SOLR index -> http://localhost:8983/solr Ind

Re: Nutch 2.3 ?

2014-05-12 Thread BlackIce
mailing list seems to have been a bit screwy reply to the other Nutch 2.x question: I have httpcore-4.2.5.jar with what/where does it have to match? thnx On Thu, May 8, 2014 at 1:40 AM, BlackIce wrote: > If Someone could explin to me how to get the code from there > > >

Re: Nutch 2.x from svn.

2014-05-13 Thread BlackIce
httpcore-4.2.5 where would I look to make sure its the right one? On Mon, May 12, 2014 at 5:48 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi BlackIce, > > > On Sun, May 11, 2014 at 9:20 AM, > wrote: > > > > > Subject: Nutch 2.x f

Re: Nutch 2.3 ?

2014-05-14 Thread BlackIce
I'm on it ;) On Wed, May 7, 2014 at 4:05 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi BlackIce, > > On Sat, May 3, 2014 at 10:52 PM, > wrote: > > > > > > > Any idea on when Nutch 2.3 will be released? > > >

Re: Nutch 1.8 Solrindexer failing

2014-05-14 Thread BlackIce
You are correct, I did some research and found it to be a TIKA issue, its is fixed by setting the "Title" field to multivalued in schema.xml.I think by default the Nutch schema should be updated accordingly! Thnx On Sat, May 3, 2014 at 8:27 PM, BlackIce wrote: > Bad Reques

Re: Nutch 2.3 ?

2014-05-15 Thread BlackIce
If Someone could explin to me how to get the code from there On Thu, May 8, 2014 at 1:39 AM, BlackIce wrote: > I'm on it ;) > > > On Wed, May 7, 2014 at 4:05 AM, Lewis John Mcgibbney < > lewis.mcgibb...@gmail.com> wrote: > >> Hi BlackIce, >> >

Re: Solr 4.7 Schema?

2014-05-16 Thread BlackIce
"Title" filed needs to be set to multivalued - Tika issue, Tioka may return multiple values for Title on PDF's On Thu, May 8, 2014 at 1:37 AM, BlackIce wrote: > Thnx > > > On Wed, May 7, 2014 at 4:07 AM, Lewis John Mcgibbney < > lewis.mcgibb...@gmail.com> w

Re: Solr 4.7 Schema?

2014-05-31 Thread BlackIce
sponse. I create a > issue for this. > > Talat > 17 May 2014 04:59 tarihinde "BlackIce" yazdı: > > > "Title" filed needs to be set to multivalued - Tika issue, Tioka may > return > > multiple values for Title on PDF's > > > > > &g

Solr Authenticication

2015-04-23 Thread BlackIce
Hi, I'm thinking of securing Solr a bit, and I'm finding that there are several ways of doing this. anyone have any experience with with the autheticication for solr in nutch? Which type of solr security does one use with Nuitch 1.9? Thnx

Re: Solr Authenticication

2015-04-24 Thread BlackIce
olr server user > > > > > solr.auth.password > password > > Solr server password > > > > > > Done! > nutch use password to index in solr. > I hope this help yo. > > This post was very useful for me. > > http://communi

re-indexing Nutch data (Best Practice?)

2015-04-25 Thread BlackIce
HI, We have our search engine now as Beta 0.1 at www.enlle.com We are using Nutch 1.9 to crawl the web and index data to Solr. Currently we are at over 4 million records, which will increase dramatically every day! It has ocurred to me that we will be tweaking Solr frequently in order to improv

Solr as backend in Nutch 2.3? Which Hbase in 2.3

2015-05-14 Thread BlackIce
I was just going trough the NUtch 2.3 IVY that it can use Solr as a backend, anyone have tried this? if so is it better than Hbase? Also thew Gora site says that Gora 0.5 in Nutch 2.3 can use: Apache Hadoop 1.0.1 and 2.4.0 Apache HBase 0.94.14 Anyone tried this? Thnx

Re: Solr as backend in Nutch 2.3? Which Hbase in 2.3

2015-05-18 Thread BlackIce
can not write HBase > that run top of Hadoop 2.x. If you prefer use Hbase on Hadoop 2 You > should Gora 0.6 > > HTH > > 2015-05-15 3:47 GMT+03:00 BlackIce : > > I was just going trough the NUtch 2.3 IVY that it can use Solr as a > > backend, anyone have tried this? if s

Complaint from a crawled website!

2015-11-18 Thread BlackIce
Hi Group, I just received a complaint from my ISP stating that my "server" was attacking someones firewall. My guess is that I had nutch crawling too agressivly. And my question is: What are "Best Practices" in order to avoid such problems? Return-path: Envelope-to: ab...@hetzner.de Delivery-date

Re: Complaint from a crawled website!

2015-11-18 Thread BlackIce
han once every 2+ seconds, but 5+ seconds is better. Also, do not select > over 500+ records for a host for each generation cycle. These guidelines > keep you safe almost all the time. Faster is possible though. > > M. > > -Original message- > From: BlackIce > Sent:

Re: Complaint from a crawled website!

2015-11-18 Thread BlackIce
; > -Original message----- > From: BlackIce > Sent: Wednesday 18th November 2015 20:51 > To: user@nutch.apache.org > Subject: Complaint from a crawled website! > > Hi Group, > > I just received a complaint from my ISP stating that my "server" was > attacking som

Re: Complaint from a crawled website!

2015-11-18 Thread BlackIce
But, what has it to do with anything that MY machine is filtered via IPtables? On Wed, Nov 18, 2015 at 10:43 PM, BlackIce wrote: > My ISP has shutdown my site without prior notice > > On Wed, Nov 18, 2015 at 10:38 PM, Markus Jelsma < > markus.jel...@openindex.io> wrote: >

Nutch 1.11 - Index Metatags

2015-12-11 Thread BlackIce
Hi, Did I miss anything? I can't get the index metatags to work in 1.11 ... No error message, no data in solr 5.3.1 Any ideas? Thnx! plugin.includes language-identifier|protocol-http|urlfilter-regex|parse-(html|tika|metatag)|index-(basic|anchor|more|metadata)|indexer-solr|scoring-opic|urln

Re: Nutch 1.11 - Index Metatags

2015-12-13 Thread BlackIce
Amazing how a little typo can drive one nuts for days On Fri, Dec 11, 2015 at 10:14 PM, BlackIce wrote: > Hi, > > Did I miss anything? I can't get the index metatags to work in 1.11 ... > > No error message, no data in solr 5.3.1 > > Any ideas? Thnx! > >

Re: Anthelion from Yahoo

2015-12-17 Thread BlackIce
Interesting indeed, in more than one way... This is just a plug-in right? so it can be compiled with nutch 1.11? On Thu, Dec 17, 2015 at 10:25 AM, Markus Jelsma wrote: > Interesting! That triple extractor and wdc parser could be useful indeed! > It already uses any23. I wonder how easy we could

Robots.txt

2016-05-24 Thread BlackIce
Hi, I've just seen on a website which tracks bots, that "Tarantula" , our nutch 1.11 based crawler is being classified as not obeying robots.txt. What's the solution?

Re: Robots.txt

2016-05-24 Thread BlackIce
Science Group (IRDS) > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > ++++++ > > > > > > > > > >

Crawldb

2016-06-13 Thread BlackIce
I would like to "groom" the crawldb My guess is that it should be an easy thing just to built upon the function that removes the 404 status and duplicates. But where do I find these? Thank you

Re: Problem integrating nutch 1.11 and solr 5.5.1 or 6.0.1

2016-06-13 Thread BlackIce
Solr 6 uses a diferentes directory structure now, follow solr tutorials on how to create a core, it will tell you where it creates the cores directory, inside that directory should be a directory called /conf thats were the shema goes. Its also a good idea to read as muchas as posible on solr, nu

Re: Problem integrating nutch 1.11 and solr 5.5.1 or 6.0.1

2016-06-13 Thread BlackIce
of thumb for Internet development : everything you need to know about the Internet can be found on the Internet. El 13/6/2016 15:53, "Jose-Marcio Martins da Cruz" < jose-marcio.mart...@mines-paristech.fr> escribió: > > Hi, > > Thanks Blackice. > > Can you suggest me

Re: Problem integrating nutch 1.11 and solr 5.5.1 or 6.0.1

2016-06-13 Thread BlackIce
Also ive learned a lot from the writings of Erik Hatcher from lucidworks . Probablemente the number 1 authority in the world in everything related to solr El 13/6/2016 15:53, "Jose-Marcio Martins da Cruz" < jose-marcio.mart...@mines-paristech.fr> escribió: > > Hi, > >

Re: Crawldb

2016-06-15 Thread BlackIce
enance, > otherwise 404s found by dead links are fetched again and again. > > Sebastian > > On 06/14/2016 10:23 PM, Lewis John Mcgibbney wrote: > > Hi BlackIce, > > > > On Mon, Jun 13, 2016 at 1:57 PM, > wrote: > > > >> From: BlackIce > >> T

Indexing to remote Solr server

2016-07-20 Thread BlackIce
Hi, Up till now we have been running nutch and solr on the same Machine. But now we have a scenario were we want to have running separate nutch instances on separaté machines and index to solr over the Internet. Since the indexing to Solr will be done over the public Internet it presents us with

Re: Indexing to remote Solr server

2016-07-20 Thread BlackIce
might want to open a ticket for. Thanks Lewis On Wed, Jul 20, 2016 at 6:11 AM, wrote: > From: BlackIce > To: user@nutch.apache.org > Cc: > Date: Wed, 20 Jul 2016 15:11:22 +0200 > Subject: Indexing to remote Solr server > Hi, > > Up till now we have been running nutch and s

Re: indexing metatags with Nutch 1.12

2016-09-09 Thread BlackIce
I had a similar problem once.. it was some stupid synrtax thing, lemme check my setup On Fri, Sep 9, 2016 at 2:46 PM, KRIS MUSSHORN wrote: > Looks like this is NOT in fact working. > > How do I get the metatags into Solr? > > i have a webpage @ https://snip/inside/directorates/cisd/asset.cfm

Re: indexing metatags with Nutch 1.12

2016-09-09 Thread BlackIce
oring-opic|urlnormalizer-( pass|regex|basic) index.parse.md metatag.description,metatag.keywords,h1,h2,h3,h4, h5,h6,metatag.title metatags.names description,keywords,title,h1,h2,h3,h4,h5,h6 On Fri, Sep 9, 2016 at 3:00 PM, BlackIce wrote: > I had a similar problem once.

RE: [Non-DoD Source] Re: indexing metatags with Nutch 1.12 (UNCLASSIFIED)

2016-09-09 Thread BlackIce
shorn@mail.mil > ~~ > > -Original Message- > From: BlackIce [mailto:blackice...@gmail.com] > Sent: Friday, September 09, 2016 9:31 AM > To: user@nutch.apache.org > Subject: [Non-DoD Source] Re: indexing metatags with Nutch 1.12 > > All

Re: nutch crawl everything

2016-09-09 Thread BlackIce
Change the -1 to a positive number like 5 or so (In the command) On Sep 9, 2016 8:20 PM, "KRIS MUSSHORN" wrote: > Executing this does NOT index everything in and under seed.txt. > > ./bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TEST_CORE > urls/ crawl -1 > > I have to run it m

Re: nutch crawl everything

2016-09-09 Thread BlackIce
it found in the first run, on the 3rd run it will fetch the links it found on the 2nd run and so forth... Have a great weekend everyone ! On Fri, Sep 9, 2016 at 9:05 PM, Comcast wrote: > Tried that. Same result > > Sent from my iPhone > > > On Sep 9, 2016, at 3:04 P

Open Graph metadata?

2016-09-18 Thread BlackIce
Can we now use Open graph metadata, if so how? Thnx Ralf

Re: control order of operations

2016-09-30 Thread BlackIce
Try these, don't remember which I used and don't have access to my setup right now (there used to be a whitelist/blacklist plugin, but I don't seem to be able to find it on Google right now) https://github.com/BayanGroup/nutch-custom-search On Sep 30, 2016 7:35 PM, "KRIS MUSSHORN" wrote: Ok bas

RE: control order of operations

2016-09-30 Thread BlackIce
Then make your own :) On Sep 30, 2016 11:13 PM, "Kris Musshorn" wrote: > Thanks blackice but I cant use a plug in that’s not been maintained in a > year in my production environment > > -Original Message- > From: BlackIce [mailto:blackice...@gmail.com] > Sent

Re: Problems with crawling images (pretty basic stuff)

2017-05-24 Thread BlackIce
Hi Filip, You mentioned that you commented out "External Links" - what do the links look like that point to the images? do they start with ":www.server.com" or something like "images.server.com"? With "External Links" turned off Nutch should interpret those links as "external sites" and thus not

Re: Problems with crawling images (pretty basic stuff)

2017-05-24 Thread BlackIce
d like to know is there a way to take control over the > search for the new links, especially if it's possible within the realm of > plugins. > > 2017-05-24 17:08 GMT+02:00 BlackIce : > > > Hi Filip, > > > > > > You mentioned that you commented out "Exter

Re: about installation of ambari and hadoop

2017-05-26 Thread BlackIce
Why would it be forbidden? Wasn't Cuba removed from the blocked Nations list under President Obama? On Fri, May 26, 2017 at 2:42 PM, Eyeris Rodriguez Rueda wrote: > Hi all. > I really want to install Ambari 2.5.0 and hadoop cluster in centos 7 but > when i try to access to the webpage it looks

Re: about installation of ambari and hadoop

2017-05-26 Thread BlackIce
treaties. The first step would be to contact Hortonworks compliance officer and see if indeed this item falls under such restrictions and then go from there. Hope this helps! Greetings! Ralf Kotowski www.enlle.com "La revolucion no sera televisada" On Fri, May 26, 2017 at 5:13 PM, Black

Re: [MASSMAIL]Re: about installation of ambari and hadoop

2017-05-26 Thread BlackIce
forbidden also. > > The problem is that i dont know how to continue. > maybe i will use a proxy to try to download the packages. > > I am very happy for your anwser and for your time to call to US Department. > really thanks. In Cuba this things could be difficult. > &

Re: [MASSMAIL]Re: about installation of ambari and hadoop

2017-05-26 Thread BlackIce
I just got off the phone with someone at techsuport over at Hortonworks who will forward my message to the corresponding person, I hope to hear back from them soon. On Fri, May 26, 2017 at 8:15 PM, BlackIce wrote: > do you have a list of the files in particular which are forbidden? > This

Re: nutch 1.x tutorial with solr 6.6.0

2017-07-09 Thread BlackIce
Sometimes it helps when one replaces the Solr.jar which comes with Nutch with the solr.jar that comes with the solr one is using On Sat, Jul 8, 2017 at 3:52 PM, Pau Paches wrote: > Hi, > I have run the Nutch 1.x Tutorial with Solr 6.6.0. > Many things do not work, there is a mismatch between the

Re: nutch 1.x tutorial with solr 6.6.0

2017-07-11 Thread BlackIce
I think by default the newer SOLR starts in "schemaless" mode.. One neds to create a config directory with ALL necessary configuration files like schema and solar.conf BEFORE creating the collection and then run a command to create this collection using this conf directory. I don't have access to m

Re: Nutch 1.13 release and Solr 6.6

2017-09-14 Thread BlackIce
Sure, that would be most excellent! On Sep 14, 2017 9:41 PM, "Hiran CHAUDHURI" wrote: > Hi there. > > > > When I tried to setup Nutch 1.13 to connect to Solr 6.6 I found out that > the Nutch schema shipped in .../conf/schema.xml needs quite some tweaking > before Solr can use it. > > The reason

Re: Unable to create core [nutch] Caused by: enablePositionIncrements is not a valid option as of Lucene 5.0

2017-09-28 Thread BlackIce
My guess would be that you need to look at schema.xml and disable PositionIncrements On Thu, Sep 28, 2017 at 6:44 PM, Sol Lederman wrote: > Hi, > > I'm following the tutorial to set up nutch with solr. I'm using a supported > pair: nutch 1.13 with solr 5.5.0. I get this error creating the nutch

Re: [VOTE] Release Apache Nutch 1.14 RC#1

2017-12-22 Thread BlackIce
17, 8:38 AM, "Sebastian Nagel" > wrote: > > > > Hi Folks, > > > > thanks to everyone who was able to review the release candidate! > > > > 72 hours have passed, please see below for vote results. > > > > [8] +1 Release this

Re: [ANNOUNCE] Apache Nutch 1.14 Release

2017-12-25 Thread BlackIce
Awesome On Mon, Dec 25, 2017 at 11:36 PM, Mattmann, Chris A (3010) < chris.a.mattm...@jpl.nasa.gov> wrote: > Great work Seb and team! > > Sent from my iPhone > > On Dec 25, 2017, at 1:29 PM, Jorge Betancourt mailto:betancourt.jo...@gmail.com>> wrote: > > Great news! > Thanks Sebastian! > > > Bes

Re: [ANNOUNCE] Apache Nutch 1.14 Release

2017-12-25 Thread BlackIce
Is it just me? The md5 checksums don't match On Tue, Dec 26, 2017 at 5:35 AM, BlackIce wrote: > Awesome > > On Mon, Dec 25, 2017 at 11:36 PM, Mattmann, Chris A (3010) < > chris.a.mattm...@jpl.nasa.gov> wrote: > >> Great work Seb and team! >> >> Sen

Re: [ANNOUNCE] Apache Nutch 1.14 Release

2017-12-25 Thread BlackIce
I'm also getting this on the source tarball and zip: gpg: BAD signature from "Sebastian Nagel " [unknown] On Tue, Dec 26, 2017 at 5:48 AM, BlackIce wrote: > Is it just me? The md5 checksums don't match > > > On Tue, Dec 26, 2017 at 5:35 AM, BlackIce wrote: &

Re: [ANNOUNCE] Apache Nutch 1.14 Release

2017-12-25 Thread BlackIce
Nevermind.. my bad.. was trying to get the files with wget trough the link to the mirrors obviously it would download only the html with the mirror list sorry On Tue, Dec 26, 2017 at 6:02 AM, BlackIce wrote: > I'm also getting this on the source tarball and zip: >

Re: Search with Accent and without accent Character

2018-02-13 Thread BlackIce
Hi, As stated it's a solr question... But I give you a hint (I don't have access to the server right now)... Stemming is different for Spanish as for English... If I remember correctly I had to use the hunspell tokenizer set for Spanish Or something similar to that.. Sorry I can't be more pre

Re: Search with Accent and without accent Character

2018-02-13 Thread BlackIce
Also in order for Spanish accents to be propperly stemmed... Something had to be set to ISO Latin And a propper file had to be supplied to solr I'm on a tablet and can't access the server to look On Feb 13, 2018 10:03 PM, "BlackIce" wrote: Hi, As stated

removing "\n"... Nutch 1.14

2018-02-26 Thread BlackIce
Hi, did run into a problem with Nutch 1.14 which I don't recall having in previous versions I'm find a lot of "\n" (Newline?) in my content of crawled sites. I've tried with different configurations/constelations of Html parser and Tika and just Tika to no avail. All the info I can find on thi

Re: removing "\n"... Nutch 1.14

2018-02-26 Thread BlackIce
ble. > > A simple > s/\n/ /g > should restore the old "look" of extracted plain texts. > > Best, > Sebastian > > > On 02/26/2018 04:17 PM, BlackIce wrote: > > Hi, > > > > did run into a problem with Nutch 1.14 which I don't recall havin

Re: Is there any way to block the hubpages while crawling

2018-03-18 Thread BlackIce
Basically what you're saying is that you need more control over what is being indexed? That's an excellent question! Greetz! On Mar 17, 2018 11:46 AM, "ShivaKarthik S" wrote: > Hi, > > Is there any way to block the hub pages & index only the articles from the > websites. I wanted to index only

Re: [MASSMAIL]Re: Preparing to release Nutch 1.15 ?

2018-06-11 Thread BlackIce
+1 stoopid question, but I can't find any info on it... can we now parse Open Graph metatags? Greetz On Mon, Jun 11, 2018 at 9:11 PM Roannel Fernández Hernández wrote: > +1 > > Regards > > - Chris Mattmann escribió: > > ++1! > > > > > > > > Sounds great. > > > > > > > > Cheers, > > > > Ch

Re: Opengraph metadata was [MASSMAIL]Re: Preparing to release Nutch 1.15 ?

2018-06-12 Thread BlackIce
og:description :The Open Graph protocol enables any web page to > become a rich object in a > social graph. > > > On 06/11/2018 11:44 PM, BlackIce wrote: > > +1 > > > > stoopid question, but I can't find any info on it... can we now parse > Open >

Re: Opengraph metadata was [MASSMAIL]Re: Preparing to release Nutch 1.15 ?

2018-06-12 Thread BlackIce
PS: Does this work when configured in site.xml like regular metatdata? On Tue, Jun 12, 2018 at 1:31 PM BlackIce wrote: > sweet thnx! > > On Tue, Jun 12, 2018 at 1:29 PM Sebastian Nagel < > wastl.na...@googlemail.com> wrote: > >> > stoopid question, but I can'

Re: Opengraph metadata was [MASSMAIL]Re: Preparing to release Nutch 1.15 ?

2018-06-12 Thread BlackIce
overwrites definition in nutch-default.xml > > On 06/12/2018 02:26 PM, BlackIce wrote: > > PS: Does this work when configured in site.xml like regular metatdata? > > > > On Tue, Jun 12, 2018 at 1:31 PM BlackIce wrote: > > > >> sweet thnx! > >>

Re: [RESULT] was [VOTE] Release Apache Nutch 1.15 RC#1

2018-08-07 Thread BlackIce
Splendid On Tue, Aug 7, 2018 at 3:46 PM lewis john mcgibbney wrote: > Excellent. Thanks for taking on release manager Seb, it’s making a huge > impact. Nice work folks. > > On Tue, Aug 7, 2018 at 05:37 wrote: > > > > > user Digest 7 Aug 2018 12:37:25 - Issue 2921 > > > > Topics (messages 34

Re: rejected by filters

2018-08-08 Thread BlackIce
I think you are correct in your assumption. According to this: https://issues.apache.org/jira/browse/NUTCH-2620?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel Nutch asumes that the TLD is no longer than 4 characters, this is being in the process of being fixed in the next rel

Re: metatag.description while index data

2018-08-30 Thread BlackIce
try making these fields "Multivalued", like so: ... On Thu, Aug 30, 2018 at 1:45 PM Amarnatha Reddy wrote: > Hi Nutch Team, > > We are trying to crwal a websites which is korea and japanees langaugae > based, while doing to index data into solr we are getting into below error, > kindly sugg

Re: metatag.description while index data

2018-08-30 Thread BlackIce
Sorry if this seems trivial, but did you reload the collection and/or restart Solr? On Thu, Aug 30, 2018 at 4:19 PM Amarnatha Reddy wrote: > Still am facing the same issue after changing the suggested values any clue > please > > Amarnath > > On Thu 30 Aug, 2018, 7:50 P

Re: Block certain parts of HTML code from being indexed

2018-11-16 Thread BlackIce
There was a plugin awhile ago which allowed you to specify different tags to be indexed or excluded from being indexed if I'm not mistaken it was this: http://www.longconnections.com/blog/2015/6/3/using-apache-nutchsolr-to-build-a-search-engine-with-auto-complete-feature Good luck and please let