Re: DIH with huge data
That sounds good option. So spark job will connect to MySQL and create solr document which is pushed into solr using solrj probably in batches. On Thu, Apr 12, 2018 at 10:48 PM, Rahul Singh wrote: > If you want speed, Spark is the fastest easiest way. You can connect to > relational tables directly and import or export to CSV / JSON and import > from a distributed filesystem like S3 or HDFS. > > Combining a dfs with spark and a highly available SolR - you are > maximizing all threads. > > -- > Rahul Singh > rahul.si...@anant.us > > Anant Corporation > > On Apr 12, 2018, 1:10 PM -0400, Sujay Bawaskar , > wrote: > > Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data > size > > is around 100GB. > > I am not much familiar with spark but are you suggesting that we should > > create document by merging distinct RDBMS tables in using RDD? > > > > On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh < > rahul.xavier.si...@gmail.com > > wrote: > > > > > How much data and what is the database source? Spark is probably the > > > fastest way. > > > > > > -- > > > Rahul Singh > > > rahul.si...@anant.us > > > > > > Anant Corporation > > > > > > On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar < > sujaybawas...@gmail.com>, > > > wrote: > > > > Hi, > > > > > > > > We are using DIH with SortedMapBackedCache but as data size > increases we > > > > need to provide more heap memory to solr JVM. > > > > Can we use multiple CSV file instead of database queries and later > data > > > in > > > > CSV files can be joined using zipper? So bottom line is to create CSV > > > files > > > > for each of entity in data-config.xml and join these CSV files using > > > > zipper. > > > > We also tried EHCache based DIH cache but since EHCache uses MMap IO > its > > > > not good to use with MMapDirectoryFactory and causes to exhaust > physical > > > > memory on machine. > > > > Please suggest how can we handle use case of importing huge amount of > > > data > > > > into solr. > > > > > > > > -- > > > > Thanks, > > > > Sujay P Bawaskar > > > > M:+91-77091 53669 > > > > > > > > > > > -- > > Thanks, > > Sujay P Bawaskar > > M:+91-77091 53669 > -- Thanks, Sujay P Bawaskar M:+91-77091 53669
Re: DIH with huge data
CSV -> Spark -> SolR https://github.com/lucidworks/spark-solr/blob/master/docs/examples/csv.adoc If speed is not an issue there are other methods. Spring Batch / Spring Data might have all the tools you need to get speed without Spark. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Apr 12, 2018, 1:10 PM -0400, Sujay Bawaskar , wrote: > Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data size > is around 100GB. > I am not much familiar with spark but are you suggesting that we should > create document by merging distinct RDBMS tables in using RDD? > > On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh wrote: > > > How much data and what is the database source? Spark is probably the > > fastest way. > > > > -- > > Rahul Singh > > rahul.si...@anant.us > > > > Anant Corporation > > > > On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar , > > wrote: > > > Hi, > > > > > > We are using DIH with SortedMapBackedCache but as data size increases we > > > need to provide more heap memory to solr JVM. > > > Can we use multiple CSV file instead of database queries and later data > > in > > > CSV files can be joined using zipper? So bottom line is to create CSV > > files > > > for each of entity in data-config.xml and join these CSV files using > > > zipper. > > > We also tried EHCache based DIH cache but since EHCache uses MMap IO its > > > not good to use with MMapDirectoryFactory and causes to exhaust physical > > > memory on machine. > > > Please suggest how can we handle use case of importing huge amount of > > data > > > into solr. > > > > > > -- > > > Thanks, > > > Sujay P Bawaskar > > > M:+91-77091 53669 > > > > > > -- > Thanks, > Sujay P Bawaskar > M:+91-77091 53669
Re: DIH with huge data
If you want speed, Spark is the fastest easiest way. You can connect to relational tables directly and import or export to CSV / JSON and import from a distributed filesystem like S3 or HDFS. Combining a dfs with spark and a highly available SolR - you are maximizing all threads. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Apr 12, 2018, 1:10 PM -0400, Sujay Bawaskar , wrote: > Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data size > is around 100GB. > I am not much familiar with spark but are you suggesting that we should > create document by merging distinct RDBMS tables in using RDD? > > On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh wrote: > > > How much data and what is the database source? Spark is probably the > > fastest way. > > > > -- > > Rahul Singh > > rahul.si...@anant.us > > > > Anant Corporation > > > > On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar , > > wrote: > > > Hi, > > > > > > We are using DIH with SortedMapBackedCache but as data size increases we > > > need to provide more heap memory to solr JVM. > > > Can we use multiple CSV file instead of database queries and later data > > in > > > CSV files can be joined using zipper? So bottom line is to create CSV > > files > > > for each of entity in data-config.xml and join these CSV files using > > > zipper. > > > We also tried EHCache based DIH cache but since EHCache uses MMap IO its > > > not good to use with MMapDirectoryFactory and causes to exhaust physical > > > memory on machine. > > > Please suggest how can we handle use case of importing huge amount of > > data > > > into solr. > > > > > > -- > > > Thanks, > > > Sujay P Bawaskar > > > M:+91-77091 53669 > > > > > > -- > Thanks, > Sujay P Bawaskar > M:+91-77091 53669
Re: DIH with huge data
Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data size is around 100GB. I am not much familiar with spark but are you suggesting that we should create document by merging distinct RDBMS tables in using RDD? On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh wrote: > How much data and what is the database source? Spark is probably the > fastest way. > > -- > Rahul Singh > rahul.si...@anant.us > > Anant Corporation > > On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar , > wrote: > > Hi, > > > > We are using DIH with SortedMapBackedCache but as data size increases we > > need to provide more heap memory to solr JVM. > > Can we use multiple CSV file instead of database queries and later data > in > > CSV files can be joined using zipper? So bottom line is to create CSV > files > > for each of entity in data-config.xml and join these CSV files using > > zipper. > > We also tried EHCache based DIH cache but since EHCache uses MMap IO its > > not good to use with MMapDirectoryFactory and causes to exhaust physical > > memory on machine. > > Please suggest how can we handle use case of importing huge amount of > data > > into solr. > > > > -- > > Thanks, > > Sujay P Bawaskar > > M:+91-77091 53669 > -- Thanks, Sujay P Bawaskar M:+91-77091 53669
Re: DIH with huge data
How much data and what is the database source? Spark is probably the fastest way. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar , wrote: > Hi, > > We are using DIH with SortedMapBackedCache but as data size increases we > need to provide more heap memory to solr JVM. > Can we use multiple CSV file instead of database queries and later data in > CSV files can be joined using zipper? So bottom line is to create CSV files > for each of entity in data-config.xml and join these CSV files using > zipper. > We also tried EHCache based DIH cache but since EHCache uses MMap IO its > not good to use with MMapDirectoryFactory and causes to exhaust physical > memory on machine. > Please suggest how can we handle use case of importing huge amount of data > into solr. > > -- > Thanks, > Sujay P Bawaskar > M:+91-77091 53669
DIH with huge data
Hi, We are using DIH with SortedMapBackedCache but as data size increases we need to provide more heap memory to solr JVM. Can we use multiple CSV file instead of database queries and later data in CSV files can be joined using zipper? So bottom line is to create CSV files for each of entity in data-config.xml and join these CSV files using zipper. We also tried EHCache based DIH cache but since EHCache uses MMap IO its not good to use with MMapDirectoryFactory and causes to exhaust physical memory on machine. Please suggest how can we handle use case of importing huge amount of data into solr. -- Thanks, Sujay P Bawaskar M:+91-77091 53669