Re: Can Apache Solr Handle TeraByte Large Data

Alexandre Rafalovitch Mon, 03 Aug 2015 14:03:07 -0700

Well,

If it is just file names, I'd probably use SolrJ client, maybe with
Java 8. Read file names, split the name into parts with regular
expressions, stuff parts into different field names and send to Solr.
Java 8 has FileSystem walkers, etc to make it easier.

You could do it with DIH, but it would be with nested entities and the
inner entity would probably try to parse the file. So, a lot of wasted
effort if you just care about the file names.

Or, I would just do a directory listing in the operating system and
use regular expressions to split it into CSV file, which I would then
import into Solr directly.

In all of these cases, the question would be which field is the ID of
the record to ensure no duplicates.

Regards,
   Alex.

----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/

On 3 August 2015 at 15:34, Mugeesh Husain <muge...@gmail.com> wrote:
> @Alexandre  No i dont need a content of a file. i am repeating my requirement
>
> I have a 40 millions of files which is stored in a file systems,
> the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
>
> I just  split all Value from a filename only,these values i have to index.
>
> I am interested to index value to solr not file contains.
>
> I have tested the DIH from a file system its work fine but i dont know how
> can i implement my code in DIH
> if my code get some value than how i can i index it using DIH.
>
> If i will use DIH then How i will make split operation and get value from
> it.
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

Reply via email to