Re: using S3 as the Directory for Solr

Walter Underwood Thu, 23 Apr 2020 19:42:28 -0700

It will be a lot more than 2X or 3X slower. Years ago, I accidentally put Solr 
indexes on an NFS mounted filesystem and it was 100X slower. S3 would be a lot 
slower than that.


Are you doing relevance-ranked searches on all that data? That is the only 
reason to use Solr instead of some other solution.

I’d use Apache Hive, or whatever has replaced it. That is what Facebook wrote 
to do searches on their multi-petabyte logs.

https://hive.apache.org

More options.

https://jethro.io/hadoop-hive
https://mapr.com/why-hadoop/sql-hadoop/sql-hadoop-details/

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 23, 2020, at 7:29 PM, Christopher Schultz 
> <ch...@christopherschultz.net> wrote:
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> Rahul,
> 
> On 4/23/20 21:49, dhurandar S wrote:
>> Thank you for your reply. The reason we are looking for S3 is since
>> the volume is close to 10 Petabytes. We are okay to have higher
>> latency of say twice or thrice that of placing data on the local
>> disk. But we have a requirement to have long-range data and
>> providing Seach capability on that.  Every other storage apart from
>> S3 turned out to be very expensive at that scale.
>> 
>> Basically I want to replace
>> 
>> -Dsolr.directoryFactory=HdfsDirectoryFactory \
>> 
>> with S3 based implementation.
> 
> Can you clarify whether you have 10 PiB of /source data/ or 10 PiB of
> /index data/?
> 
> You can theoretically store your source data anywhere, of course. 10
> PiB sounds like a truly enormous index.
> 
> - -chris
> 
>> On Thu, Apr 23, 2020 at 3:12 AM Jan Høydahl <jan....@cominvent.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> Is your data so partitioned that it makes sense to consider
>>> splitting up in multiple collections and make some arrangement
>>> that will keep only a few collections live at a time, loading
>>> index files from S3 on demand?
>>> 
>>> I cannot see how an S3 directory would be able to effectively
>>> cache files in S3 and what units the index files would be stored
>>> as?
>>> 
>>> Have you investigated EFS as an alternative? That would look like
>>> a normal filesystem to Solr but might be cheaper storage wise,
>>> but much slower.
>>> 
>>> Jan
>>> 
>>>> 23. apr. 2020 kl. 06:57 skrev dhurandar S
>>>> <dhurandarg...@gmail.com>:
>>>> 
>>>> Hi,
>>>> 
>>>> I am looking to use S3 as the place to store indexes. Just how
>>>> Solr uses HdfsDirectory to store the index and all the other
>>>> documents.
>>>> 
>>>> We want to provide a search capability that is okay to be a
>>>> little slow
>>> but
>>>> cheaper in terms of the cost. We have close to 2 petabytes of
>>>> data on
>>> which
>>>> we want to provide the Search using Solr.
>>>> 
>>>> Are there any open-source implementations around using S3 as
>>>> the
>>> Directory
>>>> for Solr ??
>>>> 
>>>> Any recommendations on this approach?
>>>> 
>>>> regards, Rahul
>>> 
>>> 
>> 
> -----BEGIN PGP SIGNATURE-----
> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
> 
> iQIyBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl6iTwUACgkQHPApP6U8
> pFjRaw/4sGbH286gZJe+wfKsLc4JPvyJZjjwVDCdpiR2SHt50IA23wYSK97R6xRj
> dbWWReA7C3JNWp6x21i8Bb6sIeLDnotbc7IOSmOMuNep1BtVaYBMJ8wyW6uUtXf6
> hQbY0Ew93ZhDlS9CWMJqbQtWfrQEqH51Xbz+4uqqvJU8Bq9o9Vv0rnuVp/5f73lV
> ihek0sbA73oGle0gC5NFmrKItnn+14X8vIxUC8JRZlY4rDSiOdOcIil3DExxOQNQ
> UodIvwKKhzALFY77PeGSSjKiy0X3JJ1rKzLeIBrW0JCNMprYLzL2CQjZ5F09MraZ
> WxXdA64lEg2diEwHywNrsaaygbEZYTWd8gaeGA7kzCk78Y2KuhWuEQej6KmE3Iq2
> AW+K7JgFakUpzB5oorCtKNLQOqFHX85ne57gCYKr42S3Htfxmf98pBdudQy4RvuT
> +tJvGYx8NLqgeOoZN4u+G/8WunlzUC+u2vUxVcIoK3Ozz0usMioFDqn69vmOxxoH
> cN2Y4T1ZZZGtndiAGZww1JXKAbVN0U41isXg2F8tHQV9dxaeoYDQ/xYbAoWEhhlM
> SVtEdr76eMJ08T6h5711gtrhSK+RQFPD2Jbr8B/Xl063xPfN2TpqmcJCKXkucvpc
> CEDLFqeKX6qIRZDgMf8EICmbFl6aF5knbDP0MkyYk4urB+uFaw==
> =Y/6Y
> -----END PGP SIGNATURE-----

Re: using S3 as the Directory for Solr

Reply via email to