Re: DIH with huge data

2018-04-12 Thread Sujay Bawaskar
That sounds good option. So spark job will connect to MySQL and create solr
document which is pushed into solr using solrj probably in batches.

On Thu, Apr 12, 2018 at 10:48 PM, Rahul Singh 
wrote:

> If you want speed, Spark is the fastest easiest way. You can connect to
> relational tables directly and import or export to CSV / JSON and import
> from a distributed filesystem like S3 or HDFS.
>
> Combining a dfs with spark and a highly available SolR - you are
> maximizing all threads.
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Apr 12, 2018, 1:10 PM -0400, Sujay Bawaskar ,
> wrote:
> > Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data
> size
> > is around 100GB.
> > I am not much familiar with spark but are you suggesting that we should
> > create document by merging distinct RDBMS tables in using RDD?
> >
> > On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh <
> rahul.xavier.si...@gmail.com
> > wrote:
> >
> > > How much data and what is the database source? Spark is probably the
> > > fastest way.
> > >
> > > --
> > > Rahul Singh
> > > rahul.si...@anant.us
> > >
> > > Anant Corporation
> > >
> > > On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar <
> sujaybawas...@gmail.com>,
> > > wrote:
> > > > Hi,
> > > >
> > > > We are using DIH with SortedMapBackedCache but as data size
> increases we
> > > > need to provide more heap memory to solr JVM.
> > > > Can we use multiple CSV file instead of database queries and later
> data
> > > in
> > > > CSV files can be joined using zipper? So bottom line is to create CSV
> > > files
> > > > for each of entity in data-config.xml and join these CSV files using
> > > > zipper.
> > > > We also tried EHCache based DIH cache but since EHCache uses MMap IO
> its
> > > > not good to use with MMapDirectoryFactory and causes to exhaust
> physical
> > > > memory on machine.
> > > > Please suggest how can we handle use case of importing huge amount of
> > > data
> > > > into solr.
> > > >
> > > > --
> > > > Thanks,
> > > > Sujay P Bawaskar
> > > > M:+91-77091 53669
> > >
> >
> >
> >
> > --
> > Thanks,
> > Sujay P Bawaskar
> > M:+91-77091 53669
>



-- 
Thanks,
Sujay P Bawaskar
M:+91-77091 53669


Re: DIH with huge data

2018-04-12 Thread Rahul Singh

CSV -> Spark -> SolR

https://github.com/lucidworks/spark-solr/blob/master/docs/examples/csv.adoc

If speed is not an issue there are other methods. Spring Batch / Spring Data 
might have all the tools you need to get speed without Spark.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 12, 2018, 1:10 PM -0400, Sujay Bawaskar , wrote:
> Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data size
> is around 100GB.
> I am not much familiar with spark but are you suggesting that we should
> create document by merging distinct RDBMS tables in using RDD?
>
> On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh  wrote:
>
> > How much data and what is the database source? Spark is probably the
> > fastest way.
> >
> > --
> > Rahul Singh
> > rahul.si...@anant.us
> >
> > Anant Corporation
> >
> > On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar ,
> > wrote:
> > > Hi,
> > >
> > > We are using DIH with SortedMapBackedCache but as data size increases we
> > > need to provide more heap memory to solr JVM.
> > > Can we use multiple CSV file instead of database queries and later data
> > in
> > > CSV files can be joined using zipper? So bottom line is to create CSV
> > files
> > > for each of entity in data-config.xml and join these CSV files using
> > > zipper.
> > > We also tried EHCache based DIH cache but since EHCache uses MMap IO its
> > > not good to use with MMapDirectoryFactory and causes to exhaust physical
> > > memory on machine.
> > > Please suggest how can we handle use case of importing huge amount of
> > data
> > > into solr.
> > >
> > > --
> > > Thanks,
> > > Sujay P Bawaskar
> > > M:+91-77091 53669
> >
>
>
>
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669


Re: DIH with huge data

2018-04-12 Thread Rahul Singh
If you want speed, Spark is the fastest easiest way. You can connect to 
relational tables directly and import or export to CSV / JSON and import from a 
distributed filesystem like S3 or HDFS.

Combining a dfs with spark and a highly available SolR - you are maximizing all 
threads.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 12, 2018, 1:10 PM -0400, Sujay Bawaskar , wrote:
> Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data size
> is around 100GB.
> I am not much familiar with spark but are you suggesting that we should
> create document by merging distinct RDBMS tables in using RDD?
>
> On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh  wrote:
>
> > How much data and what is the database source? Spark is probably the
> > fastest way.
> >
> > --
> > Rahul Singh
> > rahul.si...@anant.us
> >
> > Anant Corporation
> >
> > On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar ,
> > wrote:
> > > Hi,
> > >
> > > We are using DIH with SortedMapBackedCache but as data size increases we
> > > need to provide more heap memory to solr JVM.
> > > Can we use multiple CSV file instead of database queries and later data
> > in
> > > CSV files can be joined using zipper? So bottom line is to create CSV
> > files
> > > for each of entity in data-config.xml and join these CSV files using
> > > zipper.
> > > We also tried EHCache based DIH cache but since EHCache uses MMap IO its
> > > not good to use with MMapDirectoryFactory and causes to exhaust physical
> > > memory on machine.
> > > Please suggest how can we handle use case of importing huge amount of
> > data
> > > into solr.
> > >
> > > --
> > > Thanks,
> > > Sujay P Bawaskar
> > > M:+91-77091 53669
> >
>
>
>
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669


Re: DIH with huge data

2018-04-12 Thread Sujay Bawaskar
Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data size
is around 100GB.
I am not much familiar with spark but are you suggesting that we should
create document by merging distinct RDBMS tables in using RDD?

On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh 
wrote:

> How much data and what is the database source? Spark is probably the
> fastest way.
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar ,
> wrote:
> > Hi,
> >
> > We are using DIH with SortedMapBackedCache but as data size increases we
> > need to provide more heap memory to solr JVM.
> > Can we use multiple CSV file instead of database queries and later data
> in
> > CSV files can be joined using zipper? So bottom line is to create CSV
> files
> > for each of entity in data-config.xml and join these CSV files using
> > zipper.
> > We also tried EHCache based DIH cache but since EHCache uses MMap IO its
> > not good to use with MMapDirectoryFactory and causes to exhaust physical
> > memory on machine.
> > Please suggest how can we handle use case of importing huge amount of
> data
> > into solr.
> >
> > --
> > Thanks,
> > Sujay P Bawaskar
> > M:+91-77091 53669
>



-- 
Thanks,
Sujay P Bawaskar
M:+91-77091 53669


Re: DIH with huge data

2018-04-12 Thread Rahul Singh
How much data and what is the database source? Spark is probably the fastest 
way.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar , wrote:
> Hi,
>
> We are using DIH with SortedMapBackedCache but as data size increases we
> need to provide more heap memory to solr JVM.
> Can we use multiple CSV file instead of database queries and later data in
> CSV files can be joined using zipper? So bottom line is to create CSV files
> for each of entity in data-config.xml and join these CSV files using
> zipper.
> We also tried EHCache based DIH cache but since EHCache uses MMap IO its
> not good to use with MMapDirectoryFactory and causes to exhaust physical
> memory on machine.
> Please suggest how can we handle use case of importing huge amount of data
> into solr.
>
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669


DIH with huge data

2018-04-12 Thread Sujay Bawaskar
Hi,

We are using DIH with SortedMapBackedCache but as data size increases we
need to provide more heap memory to solr JVM.
Can we use multiple CSV file instead of database queries and later data in
CSV files can be joined using zipper? So bottom line is to create CSV files
for each of entity in data-config.xml and join these CSV files using
zipper.
We also tried EHCache based DIH cache but since EHCache uses MMap IO its
not good to use with MMapDirectoryFactory and causes to exhaust physical
memory on machine.
Please suggest how can we handle use case of importing huge amount of data
into solr.

-- 
Thanks,
Sujay P Bawaskar
M:+91-77091 53669