Hi, why not just use dremel?
Regards, Gourav Sengupta On Mon, Oct 29, 2018 at 1:35 PM lchorbadjiev <[email protected]> wrote: > Hi, > > I'm trying to reproduce the example from dremel paper > (https://research.google.com/pubs/archive/36632.pdf) in Apache Spark using > pyspark and I wonder if it is possible at all? > > Trying to follow the paper example as close as possible I created this > document type: > > from pyspark.sql.types import * > > links_type = StructType([ > StructField("Backward", ArrayType(IntegerType(), containsNull=False), > nullable=False), > StructField("Forward", ArrayType(IntegerType(), containsNull=False), > nullable=False), > ]) > > language_type = StructType([ > StructField("Code", StringType(), nullable=False), > StructField("Country", StringType()) > ]) > > names_type = StructType([ > StructField("Language", ArrayType(language_type, containsNull=False)), > StructField("Url", StringType()), > ]) > > document_type = StructType([ > StructField("DocId", LongType(), nullable=False), > StructField("Links", links_type, nullable=True), > StructField("Name", ArrayType(names_type, containsNull=False)) > ]) > > But when I store data in parquet using this type, the resulting parquet > schema is different from the described in the paper: > > message spark_schema { > required int64 DocId; > optional group Links { > required group Backward (LIST) { > repeated group list { > required int32 element; > } > } > required group Forward (LIST) { > repeated group list { > required int32 element; > } > } > } > optional group Name (LIST) { > repeated group list { > required group element { > optional group Language (LIST) { > repeated group list { > required group element { > required binary Code (UTF8); > optional binary Country (UTF8); > } > } > } > optional binary Url (UTF8); > } > } > } > } > > Moreover, if I create a parquet file with schema described in the dremel > paper using Apache Parquet Java API and try to read it into Apache Spark, I > get an exception: > > org.apache.spark.sql.execution.QueryExecutionException: Encounter error > while reading parquet files. One possible cause: Parquet column cannot be > converted in the corresponding files > > Is it possible to create example schema described in the dremel paper using > Apache Spark and what is the correct approach to build this example? > > Regards, > Lubomir Chorbadjiev > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: [email protected] > >
