Raymond, Running parallel index might be trickier than it looks if the scale is big. For instance, you can easily partition your data (let's say into 5 chunks) and run 5 processes to index them. However, you will need to be aware if there will be choke in the pipeline along the way (e.g. I/O of database, or even commits at the core). If you think your infrastructure can handle the load, you can try what i aforementioned.
On Thu, May 24, 2018 at 9:36 AM, Raymond Xie <xie3208...@gmail.com> wrote: > Thank you Rahul despite that's very high level. > > With no offense, do you have a successful implementation or it is just your > unproven idea? I never used Rabbit nor Kafka before but would be very > interested in knowing more detail on the Kafka idea as Kafka is available > in my environment. > > Thank you again and look forward to hearing more from you or anyone in this > Solr community. > > > *------------------------------------------------* > *Sincerely yours,* > > > *Raymond* > > On Wed, May 23, 2018 at 8:15 AM, Rahul Singh <rahul.xavier.si...@gmail.com > > > wrote: > > > Enumerate the file locations (map) , put them in a queue like rabbit or > > Kafka (Persist the map), have a bunch of threads , workers, containers, > > whatever pop off the queue , process the item (reduce). > > > > > > -- > > Rahul Singh > > rahul.si...@anant.us > > > > Anant Corporation > > > > On May 20, 2018, 7:24 AM -0400, Raymond Xie <xie3208...@gmail.com>, > wrote: > > > > I know how to do indexing on file system like single file or folder, but > > how do I do that in a parallel way? The data I need to index is of huge > > volume and can't be put on HDFS. > > > > Thank you > > > > *------------------------------------------------* > > *Sincerely yours,* > > > > > > *Raymond* > > > > > -- Best regards, Adhyan Arizki