Re: Indexig excel (xlsx) file into SOLR 8.1.1

2019-07-26 Thread Charlie Hull
Simpler possibly, but not necessarily reliable. If you do everything 
inside Solr's DIH with Tika under the hood to extract data from Excel, a 
malformed Excel file could kill Tika and bring down your entire Solr 
cluster. Far better to do it outside of Solr as this blog describes: 
https://lucidworks.com/post/indexing-with-solrj/


If you want to see what Tika does to your Excel examples this is quite a 
neat way to experiment: https://okfnlabs.org/projects/tika-server/


Cheers

Charlie

On 26/07/2019 09:44, Vipul Bahuguna wrote:

Hi Charlie,

Thanks for your suggestion,  but I will have thousands of these files
coming from different sources. It would become very tedious if I have to
first convert them to csv and then run liny by line.

I was hoping if there could be a simpker way to achieve these using DIH
which I thought can be configured to read and ingest MS Excel (xlsx)
files.

I am not too sure of how the configuration file would look like.

Any pointers are welcome. Thanks!

On Fri, 26 Jul, 2019, 1:56 PM Charlie Hull,  wrote:


Convert the Excel file to a CSV and then write a teeny script to go
through it line by line and submit to Solr over HTTP? Tika would
probably work but it's a lot of heavy lifting for what seems to me like
a simple problem.

Cheers

Charlie

On 26/07/2019 09:19, Vipul Bahuguna wrote:

Hi Guys - can anyone suggest how to achieve this?
I have understood how to insert json documents. So one alternative that
comes to my mind is that I can convert the rows in my excel to json

format

with the header of my excel file becoming the json keys (corresponding to
the fields I have defined in my managed-schema.xml). And then each cell

in

the excel file will become the value of this field.

However, I am sure there must be a better way and directly ingesting the
excel file to achieve the same. I was trying to reach about DIH and

Apache

Tika, but I am not very sure of how the configuration works.

My sample excel file has 4 columns namely -
1. First Name
2. Last Name
3. Phone
4. Website Link

I want to index these fields into SOLR in a way that all these columns
become my solr schema fields and later I can search based on these

fields.

Any suggestions please.

thanks !


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Re: Indexig excel (xlsx) file into SOLR 8.1.1

2019-07-26 Thread Vipul Bahuguna
Hi Charlie,

Thanks for your suggestion,  but I will have thousands of these files
coming from different sources. It would become very tedious if I have to
first convert them to csv and then run liny by line.

I was hoping if there could be a simpker way to achieve these using DIH
which I thought can be configured to read and ingest MS Excel (xlsx)
files.

I am not too sure of how the configuration file would look like.

Any pointers are welcome. Thanks!

On Fri, 26 Jul, 2019, 1:56 PM Charlie Hull,  wrote:

> Convert the Excel file to a CSV and then write a teeny script to go
> through it line by line and submit to Solr over HTTP? Tika would
> probably work but it's a lot of heavy lifting for what seems to me like
> a simple problem.
>
> Cheers
>
> Charlie
>
> On 26/07/2019 09:19, Vipul Bahuguna wrote:
> > Hi Guys - can anyone suggest how to achieve this?
> > I have understood how to insert json documents. So one alternative that
> > comes to my mind is that I can convert the rows in my excel to json
> format
> > with the header of my excel file becoming the json keys (corresponding to
> > the fields I have defined in my managed-schema.xml). And then each cell
> in
> > the excel file will become the value of this field.
> >
> > However, I am sure there must be a better way and directly ingesting the
> > excel file to achieve the same. I was trying to reach about DIH and
> Apache
> > Tika, but I am not very sure of how the configuration works.
> >
> > My sample excel file has 4 columns namely -
> > 1. First Name
> > 2. Last Name
> > 3. Phone
> > 4. Website Link
> >
> > I want to index these fields into SOLR in a way that all these columns
> > become my solr schema fields and later I can search based on these
> fields.
> >
> > Any suggestions please.
> >
> > thanks !
> >
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>
>


Re: Indexig excel (xlsx) file into SOLR 8.1.1

2019-07-26 Thread Charlie Hull
Convert the Excel file to a CSV and then write a teeny script to go 
through it line by line and submit to Solr over HTTP? Tika would 
probably work but it's a lot of heavy lifting for what seems to me like 
a simple problem.


Cheers

Charlie

On 26/07/2019 09:19, Vipul Bahuguna wrote:

Hi Guys - can anyone suggest how to achieve this?
I have understood how to insert json documents. So one alternative that
comes to my mind is that I can convert the rows in my excel to json format
with the header of my excel file becoming the json keys (corresponding to
the fields I have defined in my managed-schema.xml). And then each cell in
the excel file will become the value of this field.

However, I am sure there must be a better way and directly ingesting the
excel file to achieve the same. I was trying to reach about DIH and Apache
Tika, but I am not very sure of how the configuration works.

My sample excel file has 4 columns namely -
1. First Name
2. Last Name
3. Phone
4. Website Link

I want to index these fields into SOLR in a way that all these columns
become my solr schema fields and later I can search based on these fields.

Any suggestions please.

thanks !



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Indexig excel (xlsx) file into SOLR 8.1.1

2019-07-26 Thread Vipul Bahuguna
Hi Guys - can anyone suggest how to achieve this?
I have understood how to insert json documents. So one alternative that
comes to my mind is that I can convert the rows in my excel to json format
with the header of my excel file becoming the json keys (corresponding to
the fields I have defined in my managed-schema.xml). And then each cell in
the excel file will become the value of this field.

However, I am sure there must be a better way and directly ingesting the
excel file to achieve the same. I was trying to reach about DIH and Apache
Tika, but I am not very sure of how the configuration works.

My sample excel file has 4 columns namely -
1. First Name
2. Last Name
3. Phone
4. Website Link

I want to index these fields into SOLR in a way that all these columns
become my solr schema fields and later I can search based on these fields.

Any suggestions please.

thanks !