Re: dremel paper example schema

Gourav Sengupta Mon, 29 Oct 2018 08:42:27 -0700

Hi,

why not just use dremel?


Regards,
Gourav Sengupta

On Mon, Oct 29, 2018 at 1:35 PM lchorbadjiev <[email protected]>
wrote:

> Hi,
>
> I'm trying to reproduce the example from dremel paper
> (https://research.google.com/pubs/archive/36632.pdf) in Apache Spark using
> pyspark and I wonder if it is possible at all?
>
> Trying to follow the paper example as close as possible I created this
> document type:
>
> from pyspark.sql.types import *
>
> links_type = StructType([
>     StructField("Backward", ArrayType(IntegerType(), containsNull=False),
> nullable=False),
>     StructField("Forward", ArrayType(IntegerType(), containsNull=False),
> nullable=False),
> ])
>
> language_type = StructType([
>     StructField("Code", StringType(), nullable=False),
>     StructField("Country", StringType())
> ])
>
> names_type = StructType([
>     StructField("Language", ArrayType(language_type, containsNull=False)),
>     StructField("Url", StringType()),
> ])
>
> document_type = StructType([
>     StructField("DocId", LongType(), nullable=False),
>     StructField("Links", links_type, nullable=True),
>     StructField("Name", ArrayType(names_type, containsNull=False))
> ])
>
> But when I store data in parquet using this type, the resulting parquet
> schema is different from the described in the paper:
>
> message spark_schema {
>   required int64 DocId;
>   optional group Links {
>     required group Backward (LIST) {
>       repeated group list {
>         required int32 element;
>       }
>     }
>     required group Forward (LIST) {
>       repeated group list {
>         required int32 element;
>       }
>     }
>   }
>   optional group Name (LIST) {
>     repeated group list {
>       required group element {
>         optional group Language (LIST) {
>           repeated group list {
>             required group element {
>               required binary Code (UTF8);
>               optional binary Country (UTF8);
>             }
>           }
>         }
>         optional binary Url (UTF8);
>       }
>     }
>   }
> }
>
> Moreover, if I create a parquet file with schema described in the dremel
> paper using Apache Parquet Java API and try to read it into Apache Spark, I
> get an exception:
>
> org.apache.spark.sql.execution.QueryExecutionException: Encounter error
> while reading parquet files. One possible cause: Parquet column cannot be
> converted in the corresponding files
>
> Is it possible to create example schema described in the dremel paper using
> Apache Spark and what is the correct approach to build this example?
>
> Regards,
> Lubomir Chorbadjiev
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Re: dremel paper example schema

Reply via email to