Re: Fast write datastore...

Muthu Jayakumar Wed, 15 Mar 2017 20:21:08 -0700

>Reading your original question again, it seems to me probably you don't
need a fast data store
Shiva, You are right. I only asked about fast-write and never mentioned on
read :). For us, Cassandra may not be a choice of read because of its
a. limitations on pagination support on the server side
b. richness of filters provided when compared to elastic search... but this
can worked around by using spark dataframe.
c. a possible larger limitation for me, which is mandate on creating a
partition key column before hand. I may not be able to determine this
before hand.
But 'materialized view', 'SSTable Attached Secondary Index (SASI)' can help
alleviate to some extent.


>what performance do you expect from subsequent queries?
Uladzimir, here is what we do now...
Step 1: Run aggregate query using large number of parquets (generally
ranging from few MBs to few GBs) using Spark Dataframe.
Step 2: Attempt to store these query results in a 'fast datastore' (I have
asked for recommendations in this post). The data is usually sized from
250K to 600 million rows... Also the schema from Step 1 is not known before
hand and is usually deduced from the Dataframe schema or so. In most cases
it's a simple non-structural field.
Step 3: Run one or more queries from results stored in Step 2... These are
something as simple as pagination, filters (think of it as simple string
contains, regex, number in range, ...) and sort. For any operation more
complex than this, I have been planning to run it thru a dataframe.

Koert makes valid points on the issues with Elastic Search.

On a side note, we do use Cassandra for Spark Streaming use-cases where we
sink the data into Cassandra (for efficient upsert capabilities) and
eventually write into parquet for long term storage and trend analysis with
full table scans scenarios.

But I am thankful for many ideas and perspectives on how this could be
looked at.

Thanks,
Muthu


On Wed, Mar 15, 2017 at 7:25 PM, Shiva Ramagopal <tr.s...@gmail.com> wrote:

> Hi,
>
> The choice of ES vs Cassandra should really be made depending on your
> query use-cases. ES and Cassandra have their own strengths which should be
> matched to what you want to do rather than making a choice based on their
> respective feature sets.
>
> Reading your original question again, it seems to me probably you don't
> need a fast data store since you are doing a batch-like processing (reading
> from Parquet files) and it is possibly to control this part fully. And it
> also seems like you want to use ES. You can try to reduce the number of
> Spark executors to throttle the writes to ES.
>
> -Shiva
>
> On Wed, Mar 15, 2017 at 11:32 PM, Muthu Jayakumar <bablo...@gmail.com>
> wrote:
>
>> Hello Uladzimir / Shiva,
>>
>> From ElasticSearch documentation (i have to see the logical plan of a
>> query to confirm), the richness of filters (like regex,..) is pretty good
>> while comparing to Cassandra. As for aggregates, i think Spark Dataframes
>> is quite rich enough to tackle.
>> Let me know your thoughts.
>>
>> Thanks,
>> Muthu
>>
>>
>> On Wed, Mar 15, 2017 at 10:55 AM, vvshvv <vvs...@gmail.com> wrote:
>>
>>> Hi muthu,
>>>
>>> I agree with Shiva, Cassandra also supports SASI indexes, which can
>>> partially replace Elasticsearch functionality.
>>>
>>> Regards,
>>> Uladzimir
>>>
>>>
>>>
>>> Sent from my Mi phone
>>> On Shiva Ramagopal <tr.s...@gmail.com>, Mar 15, 2017 5:57 PM wrote:
>>>
>>> Probably Cassandra is a good choice if you are mainly looking for a
>>> datastore that supports fast writes. You can ingest the data into a table
>>> and define one or more materialized views on top of it to support your
>>> queries. Since you mention that your queries are going to be simple you can
>>> define your indexes in the materialized views according to how you want to
>>> query the data.
>>>
>>> Thanks,
>>> Shiva
>>>
>>>
>>>
>>> On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar <bablo...@gmail.com>
>>> wrote:
>>>
>>>> Hello Vincent,
>>>>
>>>> Cassandra may not fit my bill if I need to define my partition and
>>>> other indexes upfront. Is this right?
>>>>
>>>> Hello Richard,
>>>>
>>>> Let me evaluate Apache Ignite. I did evaluate it 3 months back and back
>>>> then the connector to Apache Spark did not support Spark 2.0.
>>>>
>>>> Another drastic thought may be repartition the result count to 1 (but
>>>> have to be cautions on making sure I don't run into Heap issues if the
>>>> result is too large to fit into an executor)  and write to a relational
>>>> database like mysql / postgres. But, I believe I can do the same using
>>>> ElasticSearch too.
>>>>
>>>> A slightly over-kill solution may be Spark to Kafka to ElasticSearch?
>>>>
>>>> More thoughts welcome please.
>>>>
>>>> Thanks,
>>>> Muthu
>>>>
>>>> On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling <
>>>> rsiebel...@gmail.com> wrote:
>>>>
>>>>> maybe Apache Ignite does fit your requirements
>>>>>
>>>>> On 15 March 2017 at 08:44, vincent gromakowski <
>>>>> vincent.gromakow...@gmail.com> wrote:
>>>>>
>>>>>> Hi
>>>>>> If queries are statics and filters are on the same columns, Cassandra
>>>>>> is a good option.
>>>>>>
>>>>>> Le 15 mars 2017 7:04 AM, "muthu" <bablo...@gmail.com> a écrit :
>>>>>>
>>>>>> Hello there,
>>>>>>
>>>>>> I have one or more parquet files to read and perform some aggregate
>>>>>> queries
>>>>>> using Spark Dataframe. I would like to find a reasonable fast
>>>>>> datastore that
>>>>>> allows me to write the results for subsequent (simpler queries).
>>>>>> I did attempt to use ElasticSearch to write the query results using
>>>>>> ElasticSearch Hadoop connector. But I am running into connector write
>>>>>> issues
>>>>>> if the number of Spark executors are too many for ElasticSearch to
>>>>>> handle.
>>>>>> But in the schema sense, this seems a great fit as ElasticSearch has
>>>>>> smartz
>>>>>> in place to discover the schema. Also in the query sense, I can
>>>>>> perform
>>>>>> simple filters and sort using ElasticSearch and for more complex
>>>>>> aggregate,
>>>>>> Spark Dataframe can come back to the rescue :).
>>>>>> Please advice on other possible data-stores I could use?
>>>>>>
>>>>>> Thanks,
>>>>>> Muthu
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context: http://apache-spark-user-list.
>>>>>> 1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Fast write datastore...

Reply via email to