Kryo not registered class

2017-11-19 Thread Angel Francisco Orta
Hello, I'm with spark 2.1.0 with scala and I'm registering all classes with
kryo, and I have a  problem registering this class,

org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$SerializableFileStatus$SerializableBlockLocation[]

I can't register with
classOf[Array[Class.forName("org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$SerializableFileStatus$SerializableBlockLocation").type]]


I have tried as well creating a java class like register and registering
the class as
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$SerializableFileStatus$SerializableBlockLocation[].class;

Any clue is appreciatted,

Thanks.


Re: Parquet file generated by Spark, but not compatible read by Hive

2017-06-12 Thread Angel Francisco Orta
Hello,

Do you use df.write or you make with hivecontext.sql(" insert into ...")?

Angel.

El 12 jun. 2017 11:07 p. m., "Yong Zhang"  escribió:

> We are using Spark *1.6.2* as ETL to generate parquet file for one
> dataset, and partitioned by "brand" (which is a string to represent brand
> in this dataset).
>
>
> After the partition files generated in HDFS like "brand=a" folder, we add
> the partitions in the Hive.
>
>
> The hive version is *1.2.1 *(In fact, we are using HDP 2.5.0).
>
>
> Now the problem is that for 2 brand partitions, we cannot query the data
> generated in Spark, but it works fine for the rest of partitions.
>
>
> Below is the error in the Hive CLI and hive.log I got if I query the bad
> partitions like "select * from  tablename where brand='*BrandA*' limit 3;"
>
>
> Failed with exception 
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.UnsupportedOperationException: Cannot inspect
> org.apache.hadoop.io.LongWritable
>
>
> Caused by: java.lang.UnsupportedOperationException: Cannot inspect
> org.apache.hadoop.io.LongWritable
> at org.apache.hadoop.hive.ql.io.parquet.serde.primitive.
> ParquetStringInspector.getPrimitiveWritableObject(
> ParquetStringInspector.java:52)
> at org.apache.hadoop.hive.serde2.lazy.LazyUtils.
> writePrimitiveUTF8(LazyUtils.java:222)
> at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.
> serialize(LazySimpleSerDe.java:307)
> at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(
> LazySimpleSerDe.java:262)
> at org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeField(
> DelimitedJSONSerDe.java:72)
> at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.
> doSerialize(LazySimpleSerDe.java:246)
> at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(
> AbstractEncodingAwareSerDe.java:50)
> at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.
> convert(DefaultFetchFormatter.java:71)
> at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.
> convert(DefaultFetchFormatter.java:40)
> at org.apache.hadoop.hive.ql.exec.ListSinkOperator.process(
> ListSinkOperator.java:90)
> ... 22 more
>
> There are not too much I can find by googling this error message, but it
> points to that the schema in Hive is different as in parquet file.
> But this is a very strange case, as the same schema works fine for other
> brands, which defined as a partition column, and share the whole Hive
> schema as the above.
>
> If I query like: "select * from tablename where brand='*BrandB*' limit
> 3:", everything works fine.
>
> So is this really caused by the Hive schema mismatch with parquet file
> generated by Spark, or by the data within different partitioned keys, or
> really a compatible issue between Spark/Hive?
>
> Thanks
>
> Yong
>
>
>


Re: Joins in Spark

2017-05-02 Thread Angel Francisco Orta
Sorry, I had a typo I mean repartitionby("fieldofjoin)

El 2 may. 2017 9:44 p. m., "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com>
escribió:

Hi Angel,

I am trying using the below code but i dont see partition on the dataframe.

  val iftaGPSLocation_df = sqlContext.sql(iftaGPSLocQry)
  import sqlContext._
  import sqlContext.implicits._
  datapoint_prq_df.join(geoCacheLoc_df)

Val tableA = DfA.partitionby("joinField").filter("firstSegment")

Columns I have are Lat3,Lon3, VIN, Time  . Lat3 and Lon3 are my join
columns on both dataframes and rest are select columns

Thanks,
Asmath



On Tue, May 2, 2017 at 1:38 PM, Angel Francisco Orta <
angel.francisco.o...@gmail.com> wrote:

> Have you tried to make partition by join's field and run it by segments,
> filtering both tables at the same segments of data?
>
> Example:
>
> Val tableA = DfA.partitionby("joinField").filter("firstSegment")
> Val tableB= DfB.partitionby("joinField").filter("firstSegment")
>
> TableA.join(TableB)
>
> El 2 may. 2017 8:30 p. m., "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com>
> escribió:
>
>> Table 1 (192 GB) is partitioned by year and month ... 192 GB of data is
>> for one month i.e. for April
>>
>> Table 2: 92 GB not partitioned .
>>
>> I have to perform join on  these tables now.
>>
>>
>>
>> On Tue, May 2, 2017 at 1:27 PM, Angel Francisco Orta <
>> angel.francisco.o...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Is the tables partitioned?
>>> If yes, what is the partition field?
>>>
>>> Thanks
>>>
>>>
>>> El 2 may. 2017 8:22 p. m., "KhajaAsmath Mohammed" <
>>> mdkhajaasm...@gmail.com> escribió:
>>>
>>> Hi,
>>>
>>> I am trying to join two big tables in spark and the job is running for
>>> quite a long time without any results.
>>>
>>> Table 1: 192GB
>>> Table 2: 92 GB
>>>
>>> Does anyone have better solution to get the results fast?
>>>
>>> Thanks,
>>> Asmath
>>>
>>>
>>>
>>


Re: Joins in Spark

2017-05-02 Thread Angel Francisco Orta
Have you tried to make partition by join's field and run it by segments,
filtering both tables at the same segments of data?

Example:

Val tableA = DfA.partitionby("joinField").filter("firstSegment")
Val tableB= DfB.partitionby("joinField").filter("firstSegment")

TableA.join(TableB)

El 2 may. 2017 8:30 p. m., "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com>
escribió:

> Table 1 (192 GB) is partitioned by year and month ... 192 GB of data is
> for one month i.e. for April
>
> Table 2: 92 GB not partitioned .
>
> I have to perform join on  these tables now.
>
>
>
> On Tue, May 2, 2017 at 1:27 PM, Angel Francisco Orta <
> angel.francisco.o...@gmail.com> wrote:
>
>> Hello,
>>
>> Is the tables partitioned?
>> If yes, what is the partition field?
>>
>> Thanks
>>
>>
>> El 2 may. 2017 8:22 p. m., "KhajaAsmath Mohammed" <
>> mdkhajaasm...@gmail.com> escribió:
>>
>> Hi,
>>
>> I am trying to join two big tables in spark and the job is running for
>> quite a long time without any results.
>>
>> Table 1: 192GB
>> Table 2: 92 GB
>>
>> Does anyone have better solution to get the results fast?
>>
>> Thanks,
>> Asmath
>>
>>
>>
>


Re: Joins in Spark

2017-05-02 Thread Angel Francisco Orta
Hello,

Is the tables partitioned?
If yes, what is the partition field?

Thanks


El 2 may. 2017 8:22 p. m., "KhajaAsmath Mohammed" 
escribió:

Hi,

I am trying to join two big tables in spark and the job is running for
quite a long time without any results.

Table 1: 192GB
Table 2: 92 GB

Does anyone have better solution to get the results fast?

Thanks,
Asmath