Spark Sparser library

2018-08-09 Thread umargeek
Hi Team,

Please let me know the spark Sparser library to use while submitting the
spark application to use below mentioned format,

val df = spark.read.format("*edu.stanford.sparser.json*")

When I used above format it throwed error class not found exception.

Thanks,
Umar




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to validate orc vectorization is working within spark application?

2018-07-12 Thread umargeek
Hello Jorn,

I am unable to post the entire code due to some data sharing related issues.

Use Case: I am performing aggregations after reading data from HDFS file
every min would like to understand how to perform using vectorisation 
enabled and what are pre requisite to successfully to enable the same.

Thanks,
Umar



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Pyspark Structured Streaming Error

2018-07-12 Thread umargeek
Hi All,

I am trying to test structured streaming using pyspark mentioned below spark
submit commands and packages used
*
pyspark2 --master=yarn --packages
org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 --packages
org.apache.kafka:kafka-clients:0.10.0.1*

 but getting following error (in bold),



sing Python version 3.6.5 (default, Apr 10 2018 17:08:37)
SparkSession available as 'spark'.
>>> df = spark.readStream.format("kafka").option("kafka.bootstrap.servers",
>>> "pds1:9092,pds2:9092,pds3:9092").option("subscribe", "ingest").load()
Traceback (most recent call last):
  File "", line 1, in 
  File
"/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/python/pyspark/sql/streaming.py",
line 403, in load
return self._df(self._jreader.load())
  File
"/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
line 1160, in __call__
  File
"/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/python/pyspark/sql/utils.py",
line 63, in deco
return f(*a, **kw)
  File
"/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py",
line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o64.load.
*: java.lang.ClassNotFoundException: Failed to find data source: kafka.
Please find packages at http://spark.apache.org/third-party-projects.html*
at
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:159)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 12 more



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark DF to Hive table with both Partition and Bucketing not working

2018-06-19 Thread umargeek
Hi Folks,

I am trying to save a spark data frame after reading from ORC file and add
two new columns and finally trying to save it to hive table with both
partition and bucketing feature.

Using Spark 2.3 (as both partition and bucketing feature are available in
this version).

Looking for advise.

Code Snippet:

df_orc_data =
spark.read.format("orc").option("delimiter","|").option("header",
"true").option("inferschema", "true").load(filtered_path)
df_fil_ts_data = df_orc_data.withColumn("START_TS",
lit(process_time).cast("timestamp"))
daily = (datetime.datetime.utcnow().strftime('%Y-%m-%d'))
df_filtered_data =
df_fil_ts_data.withColumn("DAYPART",lit(daily).cast("string")
hour = (datetime.datetime.utcnow().strftime('%H'))
df_filtered = df_filtered_data.withColumn("HRS",lit(hour).cast("string"))
(df_filtered.write.partitionBy("DAYPART").bucketBy(24,"HRS").sortBy("HRS").mode("append").orc('/user/umar/netflow_filtered').saveAsTable("default.DDOS_NETFLOW_FILTERED"))

Error:
"'save' does not support bucketing right now;"



Thanks,
Umar



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to validate orc vectorization is working within spark application?

2018-06-19 Thread umargeek
Hi Folks,

I would just require few pointers on the above query w.r.t vectorization
looking forward for support from the community.

Thanks,
Umar



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: testing frameworks

2018-05-22 Thread umargeek
Hi Steve,

you can try out pytest-spark plugin if your writing programs using pyspark
,please find below link for reference.

https://github.com/malexer/pytest-spark
  

Thanks,
Umar



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Alternative for numpy in Spark Mlib

2018-05-22 Thread umargeek
Hi Folks,

I am planning to rewrite one of my python module written for entropy
calculation using numpy into Spark Mlib so that it can be processed in
distributed manner.

Can you please advise on the possibilities of the same approach or any
alternatives.

Thanks,
Umar



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to validate orc vectorization is working within spark application?

2018-05-22 Thread umargeek
Hi Folks,

I have enabled below listed configurations within my spark streaming
application but I did not gain performance benefit even after setting these
parameters ,can you please help me is there a way to validate whether
vectorization is working as expeced/enabled correctly !

Note: I am using Spark 2.3 and converted all the data within my application
in orc format.

sparkSqlCtx.setConf("spark.sql.orc.filterPushdown", "true")
sparkSqlCtx.setConf("spark.sql.orc.enabled", "true")
sparkSqlCtx.setConf("spark.sql.hive.convertMetastoreOrc", "true")
sparkSqlCtx.setConf("spark.sql.orc.char.enabled", "true")
sparkSqlCtx.setConf("spark.sql.orc.impl","native")
sparkSqlCtx.setConf("spark.sql.orc.enableVectorizedReader","true")

Thanks,
Umar



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Streaming Analytics/BI tool to connect Spark SQL

2017-12-07 Thread umargeek
Hi All,

We are currently looking for real-time streaming analytics of data stored as
Spark SQL tables is there any external connectivity available to connect
with BI tools(Pentaho/Arcadia).

currently, we are storing data into the hive tables but its response on the
Arcadia dashboard is slow. 

Looking for suggestions whether to move out from hive or any connectivity
for Spark SQL or to Ignite?

Thanks,
Umar



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to write dataframe to kafka topic in spark streaming application using pyspark other than collect?

2017-12-07 Thread umargeek
Hi Team,

Can someone please advise me on the above post since because of this I have
written data file to HDFS location. 
So as of now am just passing the filename into Kafka topic and not utilizing
Kafka potential at the best looking forward to suggestions.

Thanks,
Umar



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Suggestions on using scala/python for Spark Streaming

2017-10-26 Thread umargeek
We are building a spark streaming application which is process and time
intensive and currently using python API but looking forward for suggestions
whether to use Scala over python such as pro's and con's as we are planning
to production setup as next step?

Thanks,
Umar 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to write dataframe to kafka topic in spark streaming application using pyspark?

2017-09-25 Thread umargeek
Can anyone provide me code snippet/ steps to write a data frame to Kafka
topic in a spark streaming application using pyspark with spark 2.1.1 and
Kafka 0.8 (Direct Stream Approach)?

Thanks,
Umar



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org