Hi John, I tried to follow your description but failed to reproduce this issue. Would you mind to provide some more details? Especially:
- Exact Git commit hash of the snapshot version you were using Mine: e0f946265b9ea5bc48849cf7794c2c03d5e29fba <https://github.com/apache/spark/commit/e0f946265b9ea5bc48849cf7794c2c03d5e29fba> - Compilation flags (Hadoop version, profiles enabled, etc.) Mine: ./sbt/sbt -Pyarn,kinesis-asl,hive,hadoop-2.3 -Dhadoop.version=2.3.0 clean assembly/assembly - Also, it would be great if you can provide the schema of your table plus some sample data that can help reproduce this issue. Cheng On Wed, Aug 20, 2014 at 6:11 AM, John Omernik <j...@omernik.com> wrote: > I am working with Spark SQL and the Thrift server. I ran into an > interesting bug, and I am curious on what information/testing I can provide > to help narrow things down. > > My setup is as follows: > > Hive 0.12 with a table that has lots of columns (50+) stored as rcfile. > Spark-1.1.0-SNAPSHOT with Hive Built in (and Thrift Server) > > My query is only selecting one STRING column from the data, but only > returning data based on other columns . > > Types: > col1 = STRING > col2 = STRING > col3 = STRING > col4 = Partition Field (TYPE STRING) > > Queries > cache table table1; > --Run some other queries on other data > select col1 from table1 > where col2 = 'foo' and col3 = 'bar' and col4 = 'foobar' and col1 is not > null limit 100 > > Fairly simple query. > > When I run this in SQL Squirrel I get no results. When I remove the and > col1 is not null I get 100 rows of <null> > > When I run this in beeline (the one that is in the spark-1.1.0-SNAPSHOT) I > get no results and when I remove 'and col1 is not null' I gett 100 rows of > <null> > > Note: Both of these are after I ran some other queries.. .i.e. on other > columns, after I ran CACHE TABLE TABLE1 first before any queries. That > seemed interesting to me... > > So I went to the spark-shell to determine if it was a spark issue, or a > thrift issue. > > I ran: > val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) > import hiveContext._ > cacheTable("table1") > > Then I ran the same "other" queries" got results, and then I ran the query > above, and I got results as expected. > > Interestingly enough, if I don't cache the table through cache table > table1 in thrift, I get results for all queries. If I uncache, I start > getting results again. > > I hope I was clear enough here, I am happy to help however I can. > > John > > >