Re: What is the best way of Indexing different formats of documents?

Dan Davis Tue, 07 Apr 2015 07:45:18 -0700

Sangeetha,

You can also run Tika directly from data import handler, and Data Import
Handler can be made to run several threads if you can partition the input
documents by directory or database id.   I've done 4 "threads" by having a
base configuration that does an Oracle query like this:

      SELECT * (SELECT id, url, ..., Modulo(rowNum, 4) as threadid FROM ...
WHERE ...) WHERE threadid = %d

A bash/sed script writes several data import handler XML files.
I can then index several threads at a time.

Each of these threads can then use all the transformers, e.g.
templateTransformer, etc.
XML can be transformed via XSLT.

The Data Import Handler has other entities that go out to the web and then
index the document via Tika.

If you are indexing generic HTML, you may want to figure out an approach to
SOLR-3808 and SOLR-2250 - this can be resolved by recompiling Solr and Tika
locally, because Boilerpipe has a bug that has been fixed, but not pushed
to Maven Central.   Without that, the ASF cannot include the fix, but
distributions such as LucidWorks Solr Enterprise can.

I can drop some configs into github.com if I clean them up to obfuscate
host names, passwords, and such.

On Tue, Apr 7, 2015 at 9:14 AM, Yavar Husain <yavarhus...@gmail.com> wrote:

> Well have indexed heterogeneous sources including a variety of NoSQL's,
> RDBMs and Rich Documents (PDF Word etc.) using SolrJ. The only prerequisite
> of using SolrJ is that you should have an API to fetch data from your data
> source (Say JDBC for RDBMS, Tika for extracting text content from rich
> documents etc.) than SolrJ is so damn great and simple. Its as simple as
> downloading the jar and few lines of code to send data to your solr server
> after pre-processing your data. More details here:
>
> http://lucidworks.com/blog/indexing-with-solrj/
>
> https://wiki.apache.org/solr/Solrj
>
> http://www.solrtutorial.com/solrj-tutorial.html
>
> Cheers,
> Yavar
>
>
>
> On Tue, Apr 7, 2015 at 4:18 PM, sangeetha.subraman...@gtnexus.com <
> sangeetha.subraman...@gtnexus.com> wrote:
>
> > Hi,
> >
> > I am a newbie to SOLR and basically from database background. We have a
> > requirement of indexing files of different formats (x12,edifact,
> csv,xml).
> > The files which are inputted can be of any format and we need to do a
> > content based search on it.
> >
> > From the web I understand we can use TIKA processor to extract the
> content
> > and store it in SOLR. What I want to know is, is there any better
> approach
> > for indexing files in SOLR ? Can we index the document through streaming
> > directly from the Application ? If so what is the disadvantage of using
> it
> > (against DIH which fetches from the database)? Could someone share me
> some
> > insight on this ? ls there any web links which I can refer to get some
> idea
> > on it ? Please do help.
> >
> > Thanks
> > Sangeetha
> >
> >
>

Re: What is the best way of Indexing different formats of documents?

Reply via email to