Re: index and query org.apache.ignite.spark.IgniteRDD[String,org.apache.spark.sql.Row]

DmitryB Thu, 03 Mar 2016 20:00:51 -0800

Hi Andrey,

Thanks a lots for your help.
Unfortunately, i can not use case classes, because a schema information is
only available at runtime; 
to make it more clear let me add more details. suppose that i have a very
big data set (~500 Tb) which is stored in AWS s3 in a parquet format; Using
spark, i can process (filter + join) it and reduce size down to ~200 -500
Gb; resulted dataset i would like to save in ignite cache using IgniteRdd
and create indexes for a particular set of fields which will be used later
for running queries (filter, join, aggregations); My assumption is that
having this result dataset in ignite + indexes would help to improve the
performance comparing to using spark DataFrame (persisted);
Unfortunately, the resulted dataset schema can vary with great number of
variations; So, it seems impossible to describe all of them with case
classes;
This is why an approach to store spark.sql.row + describe query fields and
indexes using QueryEntity would be preferable;
Thanks to your explanation, i see that this approach doesn't works; 
Another solutions that is spinning in my head is to generate case classes
dynamically (at runtime) based on spark data frame schema, then map sql.rows
to RDD[generated_case_class], describe ignite query and index fields using
QueryEntity, create IgniteContext for generated case class; Im not sure that
this approach is even possible, so i would like to ask for your opinion
before i go deeper; 
Will be very grateful for advice


Best regards,
Dmitry









--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/index-and-query-org-apache-ignite-spark-IgniteRDD-String-org-apache-spark-sql-Row-tp3343p3363.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: index and query org.apache.ignite.spark.IgniteRDD[String,org.apache.spark.sql.Row]

Reply via email to