Re: Indexing from nutch 1.6 to solr 4.3.1 cloud

Tuğcem Oral Tue, 09 Jul 2013 05:35:39 -0700

I'll try to index ~5M crawled documents to 8-noded cluster with this patch
and notify you guys about the result.


Best.


On Tue, Jul 9, 2013 at 1:55 PM, Markus Jelsma <[email protected]>wrote:

> Hi,
>
> Just as i explained. The DistributedUpdateRequestProcessor does that on
> the Solr node for you. There's an issue at Solr for client based document
> routing which we will use when it is committed and released. Then indexing
> is as efficient as it can be. See
> https://issues.apache.org/jira/browse/SOLR-4816
>
> The problem with CommonsHttpSolrServer is that it does not fail-over as
> CloudSolrServer does which uses LBSolrServer underneath.
> CommonsHttpSolrServer also doesn't exist anymore in Solr 4.x so that won't
> work anymore when NUTCH-1486 is committed. Keep an eye on SOLR-4816.
> Hopefully it will make Solr 4.4 which is probably going to be released in
> august.
>
> Cheers
>
>
> -----Original message-----
> > From:Tuğcem Oral <[email protected]>
> > Sent: Tuesday 9th July 2013 12:51
> > To: [email protected]
> > Subject: Re: Indexing from nutch 1.6 to solr 4.3.1 cloud
> >
> > Every point is OK except one: if there's no partitioning for solrj, how
> > could, say 1000 documents, distributed across the nodes?  One-by-one?
> What
> > will be the strategy?
> >
> > No need to open a new issue, my patch does similar job w/o using
> > CloudSolrServer, but CommonsHttpSolrServer(s). I'll give a shot for your
> > patch.
> >
> > Best
> >
> >
> > On Tue, Jul 9, 2013 at 1:34 PM, Markus Jelsma <
> [email protected]>wrote:
> >
> > > Yes, it only takes URL's for your ensemble because that is how
> > > CloudSolrServer works and it is the best method of connecting to a Solr
> > > cloud from Java. As said, there is no partitioning at all (SolrJ
> document
> > > routing is not yet committed) but your Solr nodes'
> > > DistributedUpdateRequestProcessor does the redistribution of incoming
> > > documents. Documents are also not send over Zookeeper, CloudSolrServer
> only
> > > uses the Zookeeper ensemble to find all nodes of the cluster and
> > > distinguish between masters and slaves so documents are sent to masters
> > > only.
> > >
> > > Depending on what your patch exactly does you may need to open a new
> > > issue. If it's also about writing data to a SolrCloud cluster,
> NUTCH-1377
> > > via Zookeeper is the only proper way to go.
> > >
> > > Cheers
> > >
> > > -----Original message-----
> > > > From:Tuğcem Oral <[email protected]>
> > > > Sent: Tuesday 9th July 2013 12:29
> > > > To: [email protected]
> > > > Subject: Re: Indexing from nutch 1.6 to solr 4.3.1 cloud
> > > >
> > > > Markus,
> > > >
> > > > I checked yours, they're quite similar but yours only takes zookeeper
> > > > ensemble urls, mine looks for all solr urls for a cluster. How could
> you
> > > > partition the documents? Sending them over zookeeper is enough?
> > > >
> > > > BTW my patch is ready, how could suppose to attach it?
> > > >
> > > > Best
> > > >
> > > >
> > > > On Tue, Jul 9, 2013 at 1:11 PM, Markus Jelsma <
> > > [email protected]>wrote:
> > > >
> > > > > I attached a patch for support of CloudSolrServer and a Zookeeper
> > > > > ensemble. Use solr.zookeeper.hosts and solr.collection to enable
> it.
> > > Patch
> > > > > also required NUTCH-1486.
> > > > > https://issues.apache.org/jira/browse/NUTCH-1377
> > > > >
> > > > >
> > > > >
> > > > > -----Original message-----
> > > > > > From:Tuğcem Oral <[email protected]>
> > > > > > Sent: Tuesday 9th July 2013 9:31
> > > > > > To: [email protected]
> > > > > > Subject: Re: Indexing from nutch 1.6 to solr 4.3.1 cloud
> > > > > >
> > > > > > So your org.apache.nutch.indexer.solr.SolrIndexer utility is not
> > > working
> > > > > > from nutch 1.6 I suppose, that might be used from nutch 2.1.
> Because
> > > in
> > > > > 1.6
> > > > > > you cannot do such a thing, as multiple solr instances (so
> > > solrcloud) and
> > > > > > partitioning is not supported on that version.
> > > > > >
> > > > > >
> > > > > > On Tue, Jul 9, 2013 at 12:55 AM, <[email protected]> wrote:
> > > > > >
> > > > > > > I give only one url to solrindex command and solrcloud takes
> care
> > > of
> > > > > > >  partitioning. I do not use solrj and actually did not
> understand
> > > > > Markus's
> > > > > > > comments. I use solr.4.2.0 with cloud feature.
> > > > > > >
> > > > > > > Thanks.
> > > > > > > Alex.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Tuğcem Oral <[email protected]>
> > > > > > > To: user <[email protected]>
> > > > > > > Sent: Mon, Jul 8, 2013 1:26 pm
> > > > > > > Subject: Indexing from nutch 1.6 to solr 4.3.1 cloud
> > > > > > >
> > > > > > >
> > > > > > > @alex, i dont understand how could you give multiple solr urls
> > > while
> > > > > > > indexing from 1.6. Because solrindex handles given solr url
> with a
> > > > > single
> > > > > > > SolrServer instance, dont use List<SolrServer>, and also as
> @Marcus
> > > > > said,
> > > > > > > solrj doesnt support partitioning. The phrase you used
> "indexing
> > > using
> > > > > with
> > > > > > > nutch 1.6 and 2.1" seems a bit confusing for me, which version
> of
> > > > > solrj and
> > > > > > > solr (cloud) you are using is important i suppose.
> > > > > > >
> > > > > > > @erol, I can upload the patch tomorrow and notify you about it,
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > Tugcem
> > > > > > >
> > > > > > > On Monday, July 8, 2013, eakarsu wrote:
> > > > > > >
> > > > > > > > Tugcem,
> > > > > > > >
> > > > > > > > Can you please send me patch also?
> > > > > > > > I would like to test it
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > > > Erol Akarsu
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > View this message in context:
> > > > > > > >
> > > > > > >
> > > > >
> > >
> http://lucene.472066.n3.nabble.com/Indexing-from-nutch-1-6-to-solr-4-3-1-cloud-tp4075737p4076346.html
> > > > > > > > Sent from the Nutch - User mailing list archive at
> Nabble.com.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > TO
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > TO
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > TO
> > > >
> > >
> >
> >
> >
> > --
> > TO
> >
>



-- 
TO

Re: Indexing from nutch 1.6 to solr 4.3.1 cloud

Reply via email to