Spark SQL StructType error

2016-04-16 Thread nsalian
Hello, I am parsing a text file and inserting the parsed values into a Hive table. Code: files = sc.wholeTextFiles("hdfs://nameservice1:8020/user/root/email.txt", minPartitions=16, use_unicode=True) # Putting unicode to False didn't help either sqlContext.sql("DROP TABLE emails") sqlContext.sq

Re: Apache Flink

2016-04-16 Thread Ascot Moss
I compared both last month, seems to me that Flink's MLLib is not yet ready. On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh wrote: > Thanks Ted. I was wondering if someone is using both :) > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrb

Re: A number of issues when running spark-ec2

2016-04-16 Thread Ted Yu
Thanks Josh. I downloaded spark-1.6.1-bin-hadoop2.3.tgz and spark-1.6.1-bin-hadoop2.4.tgz which expand without error. On Sat, Apr 16, 2016 at 4:54 PM, Josh Rosen wrote: > Using a different machine / toolchain, I've downloaded and re-uploaded all > of the 1.6.1 artifacts to that S3 bucket, so ho

Reading conf file in Pyspark in cluster mode

2016-04-16 Thread Bijay Kumar Pathak
Hello, I have spark jobs packaged in zipped and deployed using cluster mode in AWS EMR. The job has to read conf file packaged with the zip under the resources directory. I can read the conf file in the client mode but not in cluster mode. How do I read the conf file packaged in the zip while dep

Re: A number of issues when running spark-ec2

2016-04-16 Thread Josh Rosen
Using a different machine / toolchain, I've downloaded and re-uploaded all of the 1.6.1 artifacts to that S3 bucket, so hopefully everything should be working now. Let me know if you still encounter any problems with unarchiving. On Sat, Apr 16, 2016 at 3:10 PM Ted Yu wrote: > Pardon me - there

Re: A number of issues when running spark-ec2

2016-04-16 Thread Ted Yu
Pardon me - there is no tarball for hadoop 2.7 I downloaded https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz and successfully expanded it. FYI On Sat, Apr 16, 2016 at 2:52 PM, Jon Gregg wrote: > That link points to hadoop2.6.tgz. I tried changing the URL to > http

Re: A number of issues when running spark-ec2

2016-04-16 Thread Jon Gregg
That link points to hadoop2.6.tgz. I tried changing the URL to https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.7.tgz and I get a NoSuchKey error. Should I just go with it even though it says hadoop2.6? On Sat, Apr 16, 2016 at 5:37 PM, Ted Yu wrote: > BTW this was the or

Re: A number of issues when running spark-ec2

2016-04-16 Thread Ted Yu
BTW this was the original thread: http://search-hadoop.com/m/q3RTt0Oxul0W6Ak The link for spark-1.6.1-bin-hadoop2.7 is https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.7.tgz On Sat, Apr 16, 201

Re: A number of issues when running spark-ec2

2016-04-16 Thread Ted Yu
>From the output you posted: --- Unpacking Spark gzip: stdin: not in gzip format tar: Child returned status 1 tar: Error is not recoverable: exiting now --- The artifact for spark-1.6.1-bin-hadoop2.6 is corrupt. This problem has been reported in other threads. Try spark-1.6.1-bin-hadoop

A number of issues when running spark-ec2

2016-04-16 Thread YaoPau
I launched a cluster with: "./spark-ec2 --key-pair my_pem --identity-file ../../ssh/my_pem.pem launch jg_spark2" and I got the "Spark standalone cluster started at http://ec2-54-88-249-255.compute-1.amazonaws.com:8080"; and "Done!" success confirmations at the end. I confirmed on EC2 that 1 Master

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-16 Thread Ted Yu
Kevin: Can you describe how you got over the Metadata fetch exception ? > On Apr 16, 2016, at 9:41 AM, Kevin Eid wrote: > > One last email to announce that I've fixed all of the issues. Don't hesitate > to contact me if you encounter the same. I'd be happy to help. > > Regards, > Kevin > >> O

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-16 Thread Kevin Eid
One last email to announce that I've fixed all of the issues. Don't hesitate to contact me if you encounter the same. I'd be happy to help. Regards, Kevin On 14 Apr 2016 12:39 p.m., "Kevin Eid" wrote: > Hi all, > > I managed to copy my .py files from local to the cluster using SCP . And I > mana

Spark streaming batch time displayed is not current system time but it is processing current messages

2016-04-16 Thread Hemalatha A
Can anyone help me in debugging this issue please. On Thu, Apr 14, 2016 at 12:24 PM, Hemalatha A < hemalatha.amru...@googlemail.com> wrote: > Hi, > > I am facing a problem in Spark streaming. > Time: 1460823006000 ms --- ---

Re: Apache Flink

2016-04-16 Thread Mich Talebzadeh
Thanks Ted. I was wondering if someone is using both :) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com O

Re: Apache Flink

2016-04-16 Thread Ted Yu
Looks like this question is more relevant on flink mailing list :-) On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh wrote: > Hi, > > Has anyone used Apache Flink instead of Spark by any chance > > I am interested in its set of libraries for Complex Event Processing. > > Frankly I don't know if

Apache Flink

2016-04-16 Thread Mich Talebzadeh
Hi, Has anyone used Apache Flink instead of Spark by any chance I am interested in its set of libraries for Complex Event Processing. Frankly I don't know if it offers far more than Spark offers. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxian

Re: ERROR [main] client.ConnectionManager$HConnectionImplementation: The node /hbase is not in ZooKeeper.

2016-04-16 Thread Ted Yu
Please send query to user@hbase This is the default value: zookeeper.znode.parent /hbase Looks like hbase-site.xml accessible on your client didn't have up-to-date value for zookeeper.znode.parent Please make sure hbase-site.xml with proper config is on the classpath. On Sat, Apr 16, 20

Re: orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Mich Talebzadeh
Apologies that should read *desc formatted * Example for table dummy hive> desc formatted dummy; OK # col_name data_type comment id int clustered int scattered int randomised int random_string var

Re: orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Maurin Lenglart
Hi, I will put the date in the correct format in the future. And see if that change anything. The query that I sent is just an exemple of one aggregation possible, I have a lot of them possible on the same table, so I am not sure that sorting all of them could actually have an impact. I am usi

Re: orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Maurin Lenglart
Hi, I have : 17970737 rows I tried to do a “desc formatted statistics myTable” but I get “Error while compiling statement: FAILED: SemanticException [Error 10001]: Table not found statistics” Even after doing something like : “ANALYZE TABLE myTable COMPUTE STATISTICS FOR COLUMNS" Thank you for

Re: orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Jörn Franke
Generally a recommendation (besides the issue) - Do not put dates as String. I recommend here to make them ints. It will be in both cases much faster. It could be that you load them differently in the tables. Generally for these tables you should insert them in both cases sorted into the tables

Re: orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Mich Talebzadeh
Have you analysed statistics on the ORC table? How many rows are there? Also send the outp of desc formatted statistics HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Maurin Lenglart
Hi, I am executing one query : “SELECT `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2” My table was created something like : CREATE TA