Re: SparkSQL LEFT JOIN problem

2014-10-13 Thread invkrh
Hi,

Thank you Liquan. I just missed some in information in my previous post.

I just solved the problem.

Actually, I use the first line(schema header) of the CSV file to generate
StructType and StructField. However, the input file is UTF-8 Unicode (*with*
BOM), so the first char of the file is #65279(or U+FEFF).

As a result, the first field has a leading #65279 char. When querying, I
just used account_id, so SparkSQL cannot find the given field in AST, while
the one in AST is #65279account_id.

So the solution this to convert input file to UTF-8 Unicode (*without* BOM),
that will remove the leading #65279. Everything is fine now.

As #65279 is not printable, it's not easy to find the bug, given that the
error msg made me think it's SparkSQL's problem.

Really hope that the exception msg of SparkSQL could be a little more
explicit for developer.

Regards,

Hao




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-LEFT-JOIN-problem-tp16152p16277.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL LEFT JOIN problem

2014-10-10 Thread Liquan Pei
Hi

Can you try
select birthday from customer left join profile on customer.account_id =
profile.account_id
to see if the problems remains on your entire data?

Thanks,
Liquan

On Fri, Oct 10, 2014 at 8:20 AM, invkrh  wrote:

> Hi,
>
> I am exploring SparkSQL 1.1.0, I have a problem on LEFT JOIN.
>
> Here is the request:
>
> select * from customer left join profile on customer.account_id =
> profile.account_id
>
> The two tables' schema are shown as following:
>
> // Table: customer
> root
>  |-- account_id: string (nullable = false)
>  |-- birthday: string (nullable = true)
>  |-- preferstore: string (nullable = true)
>  |-- registstore: string (nullable = true)
>  |-- gender: string (nullable = true)
>  |-- city_name_en: string (nullable = true)
>  |-- register_date: string (nullable = true)
>  |-- zip: string (nullable = true)
>
> // Table: profile
> root
>  |-- account_id: string (nullable = false)
>  |-- card_type: string (nullable = true)
>  |-- card_upgrade_time_black: string (nullable = true)
>  |-- card_upgrade_time_gold: string (nullable = true)
>
> However, I have always an exception:
>
> Exception in thread "main"
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
> attributes: *, tree:
> Project [*]
>  Join LeftOuter, Some(('customer.account_id = 'profile.account_id))
>   Subquery customer
>SparkLogicalPlan (ExistingRdd
>
> [account_id#0,birthday#1,preferstore#2,registstore#3,gender#4,city_name_en#5,register_date#6,zip#7],
> MappedRDD[5] at map at SQLFetcher.scala:43)
>   Subquery profile
>SparkLogicalPlan (ExistingRdd
>
> [account_id#8,card_type#9,card_upgrade_time_black#10,card_upgrade_time_gold#11],
> MappedRDD[12] at map at SQLFetcher.scala:43)
>
> I was not sure where the problem is. So I create two simple tables to
> isolate the problem.
>
> // table 1
> a   b   c
> 4   8   9
> 1   3   4
> 3   4   5
>
> // table 2
> a   b   c
> 1   2   3
> 4   5   6
>
> This time, it works.
>
> So the problem might be in data. I have just sampled some lines of input
> tables to create new ones.
> This also works.
>
> I am so confused. The problem is in the data, but the error messages are
> not
> enough to find it (if I am not missing anything.)
>
> Some lines of the sampled tables.
>
> // Table: customer
>
> [50660,1975-06-05 00:00:00.000,13,12,male,ningboshi,2006-12-14
> 00:00:00.000,]
> [50666,1984-02-23 00:00:00.000,72,5,Female,beijingshi,2006-12-14
> 00:00:00.000,100086]
> [50680,1976-11-25 00:00:00.000,59,5,Female,beijingshi,2006-12-14
> 00:00:00.000,100022]
> [85,1971-03-27 00:00:00.000,2,2,Female,shanghaishi,2005-09-20
> 00:00:00.000,200336]
>
>
> // Table: profile
>
> [1144681,3,2010-02-18 00:00:00.000,2013-02-28 00:00:00.000]
> [50666,2,2010-10-31 00:00:00.000,]
> [3930657,1,,]
> [1056365,2,2009-12-29 00:00:00.000,]
>
> Any help ? =)
>
> Hao
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-LEFT-JOIN-problem-tp16152.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst