Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
> > But it's really weird to be setting SPARK_HOME in the environment of > your node managers. YARN shouldn't need to know about that. > On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang > wrote: > > > > > https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
so it does not get > expanded by the shell). > > But it's really weird to be setting SPARK_HOME in the environment of > your node managers. YARN shouldn't need to know about that. > On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang > wrote: > > > > > https://github.c

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
d to be setting SPARK_HOME in the environment of >> your node managers. YARN shouldn't need to know about that. >> On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang >> wrote: >> > >> > >> https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
d from your gateway machine to YARN by > default. > > You probably have some configuration (in spark-defaults.conf) that > tells YARN to use a cached copy. Get rid of that configuration, and > you can use whatever version you like. > On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang > wro

Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
27;, 'FAIR') > ,('spark.shuffle.service.enabled', 'true') > ,('spark.dynamicAllocation.enabled', 'true') > ]) > py_files = > ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip'] > sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", > conf=sparkConf, pyFiles=py_files) > > Thanks, -- Jianshi Huang

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
- Are all files readable by the user running the history server? > - Did all applications call sc.stop() correctly (i.e. files do not have > the ".inprogress" suffix)? > > Other than that, always look at the logs first, looking for any errors > that may be thrown. > > &

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
BTW, is there an option to set file permission for spark event logs? Jianshi On Thu, May 28, 2015 at 11:25 AM, Jianshi Huang wrote: > Hmm...all files under the event log folder has permission 770 but > strangely my account cannot read other user's files. Permission denied. > >

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
gt; > On Wed, May 27, 2015 at 5:33 AM, Jianshi Huang > wrote: > >> No one using History server? :) >> >> Am I the only one need to see all user's logs? >> >> Jianshi >> >> On Thu, May 21, 2015 at 1:29 PM, Jianshi Huang >> wrote: >&

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
No one using History server? :) Am I the only one need to see all user's logs? Jianshi On Thu, May 21, 2015 at 1:29 PM, Jianshi Huang wrote: > Hi, > > I'm using Spark 1.4.0-rc1 and I'm using default settings for history > server. > > But I can only see my own

View all user's application logs in history server

2015-05-20 Thread Jianshi Huang
Hi, I'm using Spark 1.4.0-rc1 and I'm using default settings for history server. But I can only see my own logs. Is it possible to view all user's logs? The permission is fine for the user group. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Why so slow

2015-05-12 Thread Jianshi Huang
t;= 2014-04-30)) PhysicalRDD [meta#143,nvar#145,date#147], MapPartitionsRDD[6] at explain at :32 Jianshi On Tue, May 12, 2015 at 10:34 PM, Olivier Girardot wrote: > can you post the explain too ? > > Le mar. 12 mai 2015 à 12:11, Jianshi Huang a > écrit : > >> Hi,

Why so slow

2015-05-12 Thread Jianshi Huang
s like https://issues.apache.org/jira/browse/SPARK-5446 is still open, when can we have it fixed? :) -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-06 Thread Jianshi Huang
I'm using the default settings. Jianshi On Wed, May 6, 2015 at 7:05 PM, twinkle sachdeva wrote: > Hi, > > Can you please share your compression etc settings, which you are using. > > Thanks, > Twinkle > > On Wed, May 6, 2015 at 4:15 PM, Jianshi Huang > wrot

FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-06 Thread Jianshi Huang
I'm facing this error in Spark 1.3.1 https://issues.apache.org/jira/browse/SPARK-4105 Anyone knows what's the workaround? Change the compression codec for shuffle output? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Parquet error reading data that contains array of structs

2015-04-27 Thread Jianshi Huang
t;> Fix Version of SPARK-4520 is not set. >> I assume it was fixed in 1.3.0 >> >> Cheers >> Fix Version >> >> On Fri, Apr 24, 2015 at 11:00 AM, Yin Huai wrote: >> >>> The exception looks like the one mentioned in >>> https://is

Re: Parquet error reading data that contains array of structs

2015-04-26 Thread Jianshi Huang
; Fix Version >> >> On Fri, Apr 24, 2015 at 11:00 AM, Yin Huai wrote: >> >>> The exception looks like the one mentioned in >>> https://issues.apache.org/jira/browse/SPARK-4520. What is the version >>> of Spark?

Parquet error reading data that contains array of structs

2015-04-24 Thread Jianshi Huang
at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193) -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: String literal in dataframe.select(...)

2015-04-22 Thread Jianshi Huang
Oh, I found it out. Need to import sql.functions._ Then I can do table.select(lit("2015-04-22").as("date")) Jianshi On Wed, Apr 22, 2015 at 7:27 PM, Jianshi Huang wrote: > Hi, > > I want to do this in Spark SQL DSL: > > select '2015-04-22&#x

String literal in dataframe.select(...)

2015-04-22 Thread Jianshi Huang
Hi, I want to do this in Spark SQL DSL: select '2015-04-22' as date from table How to do this? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

How to write Hive's map(key, value, ...) in Spark SQL DSL

2015-04-22 Thread Jianshi Huang
Hi, I want to write this in Spark SQL DSL: select map('c1', c1, 'c2', c2) as m from table Is there a way? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: How to do dispatching in Streaming?

2015-04-17 Thread Jianshi Huang
m lime / the big picture – in some models, > friction can be a huge factor in the equations in some other it is just > part of the landscape > > > > *From:* Gerard Maas [mailto:gerard.m...@gmail.com] > *Sent:* Friday, April 17, 2015 10:12 AM > > *To:* Evo Eftimov > *Cc:* Tath

How to do dispatching in Streaming?

2015-04-12 Thread Jianshi Huang
ne DStream -> multiple DStreams) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Add partition support in saveAsParquet

2015-03-26 Thread Jianshi Huang
Hi, Anyone has similar request? https://issues.apache.org/jira/browse/SPARK-6561 When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: def saveAsParquet(path: String, partitionColumns: Seq[String]) -- Jianshi Huang LinkedIn

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
Oh, by default it's set to 0L. I'll try setting it to 3 immediately. Thanks for the help! Jianshi On Mon, Mar 16, 2015 at 11:32 PM, Jianshi Huang wrote: > Thanks Shixiong! > > Very strange that our tasks were retried on the same executor again and

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
of our cases are the second one, we set > "spark.scheduler.executorTaskBlacklistTime" to 3 to solve such "No > space left on device" errors. So if a task runs unsuccessfully in some > executor, it won't be scheduled to the same executor in 30 seconds. > > > Best Regards, > Shi

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353 On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang wrote: > Hi, > > We're facing "No space left on device" errors lately from time to time. > The job will fail after retries. Obvious in such case, retry w

Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
he problematic datanode before retrying it. And maybe dynamically allocate another datanode if dynamic allocation is enabled. I think there needs to be a class of fatal errors that can't be recovered with retries. And it's best Spark can handle it nicely. Thanks, -- Jianshi Huang LinkedIn:

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
Liancheng also found out that the Spark jars are not included in the classpath of URLClassLoader. Hmm... we're very close to the truth now. Jianshi On Fri, Mar 13, 2015 at 6:03 PM, Jianshi Huang wrote: > I'm almost certain the problem is the ClassLoader. > > So adding

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
I'm almost certain the problem is the ClassLoader. So adding fork := true solves problems for test and run. The problem is how can I fork a JVM for sbt console? fork in console := true seems not working... Jianshi On Fri, Mar 13, 2015 at 4:35 PM, Jianshi Huang wrote: > I gues

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
nction is throwing exception >>> >>> Exception in thread "main" scala.ScalaReflectionException: class >>> org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial >>> classloader with boot classpath [.] not found >>> >>

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
Forget about my last message. I was confused. Spark 1.2.1 + Scala 2.10.4 started by SBT console command also failed with this error. However running from a standard spark shell works. Jianshi On Fri, Mar 13, 2015 at 2:46 PM, Jianshi Huang wrote: > Hmm... look like the console command st

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
Hmm... look like the console command still starts a Spark 1.3.0 with Scala 2.11.6 even I changed them in build.sbt. So the test with 1.2.1 is not valid. Jianshi On Fri, Mar 13, 2015 at 2:34 PM, Jianshi Huang wrote: > I've confirmed it only failed in console started by SBT. > >

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
@transient val sqlc = new org.apache.spark.sql.SQLContext(sc) [info] implicit def sqlContext = sqlc [info] import sqlc._ Jianshi On Fri, Mar 13, 2015 at 3:10 AM, Jianshi Huang wrote: > BTW, I was running tests from SBT when I get the errors. One test turn a > Seq of case class to Data

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
:23 AM, Jianshi Huang wrote: > Same issue here. But the classloader in my exception is somehow different. > > scala.ScalaReflectionException: class > org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with > java.net.URLClassLoader@53298398 of type class java.net.URLCla

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
th boot classpath [.] not found >>> >>> >>> Here's more info on the versions I am using - >>> >>> 2.11 >>> 1.2.1 >>> 2.11.5 >>> >>> Please let me know how can I resolve this problem. >>> >>> Thanks >>> Ashish >>> >> >> > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
user home > directories either. Typically, like in YARN, you would a number of > directories (on different disks) mounted and configured for local > storage for jobs. > > On Wed, Mar 11, 2015 at 7:42 AM, Jianshi Huang > wrote: > > Unfortunately /tmp mount is really small in ou

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
n't support expressions or wildcards in that configuration. For > each application, the local directories need to be constant. If you > have users submitting different Spark applications, those can each set > spark.local.dirs. > > - Patrick > > On Wed, Mar 11, 2015 at 12:14 AM, J

How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
Hi, I need to set per-user spark.local.dir, how can I do that? I tried both /x/home/${user.name}/spark/tmp and /x/home/${USER}/spark/tmp And neither worked. Looks like it has to be a constant setting in spark-defaults.conf. Right? Any ideas how to do that? Thanks, -- Jianshi Huang

Re: Having lots of FetchFailedException in join

2015-03-05 Thread Jianshi Huang
ar 5, 2015 at 4:01 PM, Shao, Saisai wrote: > I think there’s a lot of JIRA trying to solve this problem ( > https://issues.apache.org/jira/browse/SPARK-5763). Basically sort merge > join is a good choice. > > > > Thanks > > Jerry > > > > *From:* Jianshi Hua

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
48 PM, Jianshi Huang wrote: > I see. I'm using core's join. The data might have some skewness > (checking). > > I understand shuffle can spill data to disk but when consuming it, say in > cogroup or groupByKey, it still needs to read the whole group elements, > right? I gues

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
park core side, all the shuffle related operations can spill the > data into disk and no need to read the whole partition into memory. But if > you uses SparkSQL, it depends on how SparkSQL uses this operators. > > > > CC @hao if he has some thoughts on it. > > > > Than

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
e issues when join key is skewed or key number is > smaller, so you will meet OOM. > > > > Maybe you could monitor each stage or task’s shuffle and GC status also > system status to identify the problem. > > > > Thanks > > Jerry > > > > *From:* Jianshi

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
One really interesting is that when I'm using the netty-based spark.shuffle.blockTransferService, there's no OOM error messages (java.lang.OutOfMemoryError: Java heap space). Any idea why it's not here? I'm using Spark 1.2.1. Jianshi On Thu, Mar 5, 2015 at 1:56 PM, Jiansh

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
at 2:11 PM, Jianshi Huang wrote: > Hmm... ok, previous errors are still block fetch errors. > > 15/03/03 10:22:40 ERROR RetryingBlockFetcher: Exception while beginning > fetch of 11 outstanding blocks > java.io.IOException: Failed to connect to host-xxx

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Jianshi Huang
Davidson wrote: > Drat! That doesn't help. Could you scan from the top to see if there were > any fatal errors preceding these? Sometimes a OOM will cause this type of > issue further down. > > On Tue, Mar 3, 2015 at 8:16 PM, Jianshi Huang > wrote: > >> The failed

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Jianshi Huang
its logs as well. > > On Tue, Mar 3, 2015 at 11:03 AM, Jianshi Huang > wrote: > >> Sorry that I forgot the subject. >> >> And in the driver, I got many FetchFailedException. The error messages are >> >> 15/03/03 10:34:32 WARN TaskSetManager: Lost task 31.0 in

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Jianshi Huang
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) Jianshi On Wed, Mar 4, 2015 at 2:55 AM, Jianshi Huang wrote: > Hi, > > I got this error message: > &

[no subject]

2015-03-03 Thread Jianshi Huang
SNAPSHOT I built around Dec. 20. Is there any bug fixes related to shuffle block fetching or index files after that? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Loading tables using parquetFile vs. loading tables from Hive metastore with Parquet serde

2015-02-15 Thread Jianshi Huang
serde? Loading tables using parquetFile vs. loading tables from Hive metastore with Parquet serde Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Dynamic partition pattern support

2015-02-15 Thread Jianshi Huang
: https://issues.apache.org/jira/browse/SPARK-5828 Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-13 Thread Jianshi Huang
eynold Xin : > > I think we made the binary protocol compatible across all versions, so you >> should be fine with using any one of them. 1.2.1 is probably the best since >> it is the most recent stable release. >> >> On Tue, Feb 10, 2015 at 8:43 PM, Jianshi Huang &

Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-10 Thread Jianshi Huang
, 1.3.0) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Pig loader in Spark

2015-02-03 Thread Jianshi Huang
Hi, Anyone has implemented the default Pig Loader in Spark? (loading delimited text files with .pig_schema) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Hive UDAF percentile_approx says "This UDAF does not support the deprecated getEvaluator() method."

2015-01-13 Thread Jianshi Huang
t; >> val parameterInfo = new >> SimpleGenericUDAFParameterInfo(inspectors.toArray, false, false) >> resolver.getEvaluator(parameterInfo) >> >> FYI >> >> On Tue, Jan 13, 2015 at 1:51 PM, Jianshi Huang >> wrote: >> >>> Hi, >>

Hive UDAF percentile_approx says "This UDAF does not support the deprecated getEvaluator() method."

2015-01-13 Thread Jianshi Huang
org.apache.spark.sql.catalyst.plans.logical.Aggregate$$anonfun$output$6.apply(basicOperators.scala:143) I'm using latest branch-1.2 I found in PR that percentile and percentile_approx are supported. A bug? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-23 Thread Jianshi Huang
FYI, Latest hive 0.14/parquet will have column renaming support. Jianshi On Wed, Dec 10, 2014 at 3:37 AM, Michael Armbrust wrote: > You might also try out the recently added support for views. > > On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang > wrote: > >> Ah... I see. T

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Jianshi Huang
> > > > On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang > wrote: > >> Ok, found another possible bug in Hive. >> >> My current solution is to use ALTER TABLE CHANGE to rename the column >> names. >> >> The problem is after renaming the colum

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-08 Thread Jianshi Huang
ght? We can extract > some useful functions from JsonRDD.scala, so others can access them. > > Thanks, > > Yin > > On Mon, Dec 8, 2014 at 1:29 AM, Jianshi Huang > wrote: > >> I checked the source code for inferSchema. Looks like this is exactly >> what I want: &

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-07 Thread Jianshi Huang
I checked the source code for inferSchema. Looks like this is exactly what I want: val allKeys = rdd.map(allKeysWithValueTypes).reduce(_ ++ _) Then I can do createSchema(allKeys). Jianshi On Sun, Dec 7, 2014 at 2:50 PM, Jianshi Huang wrote: > Hmm.. > > I've created

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-06 Thread Jianshi Huang
Hmm.. I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-4782 Jianshi On Sun, Dec 7, 2014 at 2:32 PM, Jianshi Huang wrote: > Hi, > > What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD? > > I'm currently converting ea

Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-06 Thread Jianshi Huang
Hi, What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD? I'm currently converting each Map to a JSON String and do JsonRDD.inferSchema. How about adding inferSchema support to Map[String, Any] directly? It would be very useful. Thanks, -- Jianshi Huang LinkedI

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
a> sql("select cre_ts from pmt limit 1").collect res16: Array[org.apache.spark.sql.Row] = Array([null]) I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-4781 Jianshi On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang wrote: > Hmm... another issue I found

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
Hmm... another issue I found doing this approach is that ANALYZE TABLE ... COMPUTE STATISTICS will fail to attach the metadata to the table, and later broadcast join and such will fail... Any idea how to fix this issue? Jianshi On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang wrote: > V

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
Very interesting, the line doing drop table will throws an exception. After removing it all works. Jianshi On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang wrote: > Here's the solution I got after talking with Liancheng: > > 1) using backquote `..` to wrap up all illegal characte

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-05 Thread Jianshi Huang
o drop and register table val t = table(name) val newSchema = StructType(t.schema.fields.map(s => s.copy(name = s.name.replaceAll(".*?::", "" sql(s"drop table $name") applySchema(t, newSchema).registerTempTable(name) I'm testing it for now. Tha

Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-05 Thread Jianshi Huang
e external table pmt ( sorted::id bigint ) stored as parquet location '...' Obviously it didn't work, I also tried removing the identifier sorted::, but the resulting rows contain only nulls. Any idea how to create a table in HiveContext from these Parquet files? Thanks,

Re: drop table if exists throws exception

2014-12-05 Thread Jianshi Huang
xception in the logs, but that exception does not propogate to user code. >> >> On Thu, Dec 4, 2014 at 11:31 PM, Jianshi Huang >> wrote: >> >> > Hi, >> > >> > I got exception saying Hive: NoSuchObjectException(message: table >> > not found)

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
With Liancheng's suggestion, I've tried setting spark.sql.hive.convertMetastoreParquet false but still analyze noscan return -1 in rawDataSize Jianshi On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang wrote: > If I run ANALYZE without NOSCAN, then Hive can successfully

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
30 PM, Jianshi Huang wrote: > Sorry for the late of follow-up. > > I used Hao's DESC EXTENDED command and found some clue: > > new (broadcast broken Spark build): > parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, > COLUMN_STATS_ACCURATE

drop table if exists throws exception

2014-12-04 Thread Jianshi Huang
Hi, I got exception saying Hive: NoSuchObjectException(message: table not found) when running "DROP TABLE IF EXISTS " Looks like a new regression in Hive module. Anyone can confirm this? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
is will print the detail physical plan. > > > > Let me know if you still have problem. > > > > Hao > > > > *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] > *Sent:* Thursday, November 27, 2014 10:24 PM > *To:* Cheng, Hao > *Cc:* user > *Subject:* Re: Auto B

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
I created a ticket for this: https://issues.apache.org/jira/browse/SPARK-4757 Jianshi On Fri, Dec 5, 2014 at 1:31 PM, Jianshi Huang wrote: > Correction: > > According to Liancheng, this hotfix might be the root cause: > > > https://github.com/a

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Correction: According to Liancheng, this hotfix might be the root cause: https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce Jianshi On Fri, Dec 5, 2014 at 12:45 PM, Jianshi Huang wrote: > Looks like the datanucleus*.jar shouldn't appear in the hdfs

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Looks like the datanucleus*.jar shouldn't appear in the hdfs path in Yarn-client mode. Maybe this patch broke yarn-client. https://github.com/apache/spark/commit/a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53 Jianshi On Fri, Dec 5, 2014 at 12:02 PM, Jianshi Huang wrote: > Act

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Actually my HADOOP_CLASSPATH has already been set to include /etc/hadoop/conf/* export HADOOP_CLASSPATH=/etc/hbase/conf/hbase-site.xml:/usr/lib/hbase/lib/hbase-protocol.jar:$(hbase classpath) Jianshi On Fri, Dec 5, 2014 at 11:54 AM, Jianshi Huang wrote: > Looks like somehow Spark failed

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
SPATH? Jianshi On Fri, Dec 5, 2014 at 11:37 AM, Jianshi Huang wrote: > I got the following error during Spark startup (Yarn-client mode): > > 14/12/04 19:33:58 INFO Client: Uploading resource > file:/x/home/jianshuang/spark/spark-latest/lib/datanucleus-api-jdo-3.2.6.jar > -&g

Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
ter HEAD yesterday. Is this a bug? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-11-27 Thread Jianshi Huang
tps://github.com/apache/spark/pull/3270 should be > another optimization for this. > > > > > > *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] > *Sent:* Wednesday, November 26, 2014 4:36 PM > *To:* user > *Subject:* Auto BroadcastJoin optimization failed in lates

Auto BroadcastJoin optimization failed in latest Spark

2014-11-26 Thread Jianshi Huang
se has met similar situation? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: How to do broadcast join in SparkSQL

2014-11-25 Thread Jianshi Huang
/usr/lib/hive/lib doesn’t show any of the parquet jars, but ls /usr/lib/impala/lib shows the jar we’re looking for as parquet-hive-1.0.jar Is it removed from latest Spark? Jianshi On Wed, Nov 26, 2014 at 2:13 PM, Jianshi Huang wrote: > Hi, > > Looks like the latest SparkSQL with Hive 0

Re: How to do broadcast join in SparkSQL

2014-11-25 Thread Jianshi Huang
) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327) Using the same DDL and Analyze script above. Jianshi On Sat, Oct 11, 2014 at 2:18 PM, Jianshi Huang wrote: > It works fine, thanks for the help Michael. > > Liancheng also told m

Re: How to deal with BigInt in my case class for RDD => SchemaRDD convertion

2014-11-21 Thread Jianshi Huang
t; Hello Jianshi, > > The reason of that error is that we do not have a Spark SQL data type for > Scala BigInt. You can use Decimal for your case. > > Thanks, > > Yin > > On Fri, Nov 21, 2014 at 5:11 AM, Jianshi Huang > wrote: > >> Hi, >> >&

How to deal with BigInt in my case class for RDD => SchemaRDD convertion

2014-11-21 Thread Jianshi Huang
Hi, I got an error during rdd.registerTempTable(...) saying scala.MatchError: scala.BigInt Looks like BigInt cannot be used in SchemaRDD, is that correct? So what would you recommend to deal with it? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog:

NotSerializableException caused by Implicit sparkContext or sparkStreamingContext, why?

2014-11-18 Thread Jianshi Huang
ption saying that SparkContext is not serializable, which is totally irrelevant to txnSentTo I heard in Scala 2.11, there will be much better support in REPL to solve this issue. Is that true? Could anyone explain why we're having this problem? Thanks, -- Jianshi Huang LinkedIn: jians

Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-18 Thread Jianshi Huang
hreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > [error] (streaming-kafka/*:update) sbt.ResolveException: unresolved > dependency: org.apache.kafka#kafka_2.11;0.8.0: not found > [error] (catalyst/*:update) sbt.ResolveException: unresolved dependency: > org.scalama

Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Jianshi Huang
Any notable issues for using Scala 2.11? Is it stable now? Or can I use Scala 2.11 in my spark application and use Spark dist build with 2.10 ? I'm looking forward to migrate to 2.11 for some quasiquote features. Couldn't make it run in 2.10... Cheers, -- Jianshi Huang LinkedI

Re: Is there setup and cleanup function in spark?

2014-11-17 Thread Jianshi Huang
14, 2014 at 2:49 PM, Jianshi Huang > wrote: > >> Ok, then we need another trick. >> >> let's have an *implicit lazy var connection/context* around our code. >> And setup() will trigger the eval and initialization. >> > > Due to lazy evaluation, I thin

Compiling Spark master HEAD failed.

2014-11-14 Thread Jianshi Huang
wrap: scala.r eflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found. -> [Help 1] Anyone knows what's the problem? I'm building it on OSX. I didn't had this problem one month ago. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshu

Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Jianshi Huang
> If you’re just relying on the side effect of setup() and cleanup() then > I think this trick is OK and pretty cleaner. > > But if setup() returns, say, a DB connection, then the map(...) part and > cleanup() can’t get the connection object. > > On 11/14/14 1:20 PM, Jianshi Huang w

Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Jianshi Huang
w-to-translate-from-mapreduce-to-apache-spark/ > > On 11/14/14 10:44 AM, Dai, Kevin wrote: > > HI, all > > > > Is there setup and cleanup function as in hadoop mapreduce in spark which > does some initialization and cleanup work? > > > > Best Regards, > &g

Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Jianshi Huang
n and cleanup work? > > > > Best Regards, > > Kevin. > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: RDD to DStream

2014-11-12 Thread Jianshi Huang
needs to be collect to driver, is there a way to avoid doing this? Thanks Jianshi On Mon, Oct 27, 2014 at 4:57 PM, Jianshi Huang wrote: > Sure, let's still focus on the streaming simulation use case. It's a very > useful problem to solve. > > If we're going to use th

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-11-10 Thread Jianshi Huang
n of spray + akka + spark are you > using ? > > [error]org.scalamacros:quasiquotes _2.10, _2.10.3 > [trace] Stack trace suppressed: run last *:update for the full output. > [error] (*:update) Conflicting cross-version suffixes in: > org.scalamacros:quasiq > uotes > >

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-30 Thread Jianshi Huang
gt; > Can you try a Spray version built with 2.2.x along with Spark 1.1 and > include the Akka dependencies in your project’s sbt file? > > > > Mohammed > > > > *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] > *Sent:* Tuesday, October 28, 2014 8:58 PM &g

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
I'm using Spark built from HEAD, I think it uses modified Akka 2.3.4, right? Jianshi On Wed, Oct 29, 2014 at 5:53 AM, Mohammed Guller wrote: > Try a version built with Akka 2.2.x > > > > Mohammed > > > > *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.co

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
org.spark-project.akka 2.3.4-spark it should solve problem. Makes sense? I'll give it a shot when I have time, now probably I'll just not using Spray client... Cheers, Jianshi On Tue, Oct 28, 2014 at 6:02 PM, Jianshi Huang wrote: > Hi, > > I got the following exc

Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
e exception. Anyone has idea what went wrong? Need help! -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Ephemeral Hive metastore for HiveContext?

2014-10-27 Thread Jianshi Huang
I have never tried this yet, but maybe you can use an in-memory Derby > database as metastore > https://db.apache.org/derby/docs/10.7/devguide/cdevdvlpinmemdb.html > > I'll investigate this when free, guess we can use this for Spark SQL Hive > support testing. > > On 10/27/14 4

Re: Which is better? One spark app listening to 10 topics vs. 10 spark apps each listening to 1 topic

2014-10-27 Thread Jianshi Huang
Any suggestion? :) Jianshi On Thu, Oct 23, 2014 at 3:49 PM, Jianshi Huang wrote: > The Kafka stream has 10 topics and the data rate is quite high (~ 100K/s > per topic). > > Which configuration do you recommend? > - 1 Spark app consuming all Kafka topics > - 10 separ

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
pecial DStream. Jianshi On Mon, Oct 27, 2014 at 4:44 PM, Shao, Saisai wrote: > Yes, I understand what you want, but maybe hard to achieve without > collecting back to driver node. > > > > Besides, can we just think of another way to do it. > > > > Thanks >

  1   2   >