Re: Facing Error org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for s3ablock-0001-

2024-02-13 Thread Abhishek Singla
Hi Team, Could someone provide some insights into this issue? Regards, Abhishek Singla On Wed, Jan 17, 2024 at 11:45 PM Abhishek Singla < abhisheksingla...@gmail.com> wrote: > Hi Team, > > Version: 3.2.2 > Java Version: 1.8.0_211 > Scala Version: 2.12.15 > Cluster: S

Facing Error org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for s3ablock-0001-

2024-01-17 Thread Abhishek Singla
ionId, appConfig)) .option("checkpointLocation", appConfig.getChk().getPath()) .start() .awaitTermination(); Regards, Abhishek Singla

Re: config: minOffsetsPerTrigger not working

2023-04-27 Thread Abhishek Singla
Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > su

config: minOffsetsPerTrigger not working

2023-04-27 Thread Abhishek Singla
t:7077", "spark.app.name": "app", "spark.sql.streaming.kafka.useDeprecatedOffsetFetching": false, "spark.sql.streaming.metricsEnabled": true } But these configs do not seem to be working as I can see Spark processing batches of 3k-15k immediately one after another. Is there something I am missing? Ref: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html Regards, Abhishek Singla

RE: Regarding spark-3.2.0 decommission features.

2022-01-26 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi Dongjoon Hyun, Any inputs on the below issue would be helpful. Please let us know if we're missing anything? Thanks and Regards, Abhishek From: Patidar, Mohanlal (Nokia - IN/Bangalore) Sent: Thursday, January 20, 2022 11:58 AM To: user@spark.apache.org Subject: Suspected SPAM - RE

Does Apache Spark 3 support GPU usage for Spark RDDs?

2021-09-21 Thread Abhishek Shakya
usage for RDD interfaces? PS: The question is posted in stackoverflow as well: Link <https://stackoverflow.com/questions/69273205/does-apache-spark-3-support-gpu-usage-for-spark-rdds> Regards, - Abhishek Shakya Senior Data Scientist 1, Contact: +919002319890 | Em

[Spark Core] saveAsTextFile is unable to rename a directory using hadoop-azure NativeAzureFileSystem

2021-09-13 Thread Abhishek Jindal
aceImpl.java:434) [error] at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2788) [error] ... 5 more I am currently using spark-core-3.1.1.jar with hadoop-azure-3.2.2.jar but this same issue also occurs in hadoop-azure-3.3.1.jar as well. Please advise how I should solve this issue. Thanks, Abhishek

RE: Inclusive terminology usage in Spark

2021-06-30 Thread Rao, Abhishek (Nokia - IN/Bangalore)
HI Sean, Thanks for the quick response. We’ll look into this. Thanks and Regards, Abhishek From: Sean Owen Sent: Wednesday, June 30, 2021 6:30 PM To: Rao, Abhishek (Nokia - IN/Bangalore) Cc: User Subject: Re: Inclusive terminology usage in Spark This was covered and mostly done last year

Inclusive terminology usage in Spark

2021-06-30 Thread Rao, Abhishek (Nokia - IN/Bangalore)
ticket to track this. https://issues.apache.org/jira/browse/SPARK-35952 Thanks and Regards, Abhishek

RE: Why is Spark 3.0.x faster than Spark 3.1.x

2021-05-17 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi Maziyar, Mich Do we have any ticket to track this? Any idea if this is going to be fixed in 3.1.2? Thanks and Regards, Abhishek From: Mich Talebzadeh Sent: Friday, April 9, 2021 2:11 PM To: Maziyar Panahi Cc: User Subject: Re: Why is Spark 3.0.x faster than Spark 3.1.x Hi, Regarding

s3a staging committer (directory committer) not writing data to s3 bucket (final output directory) in spark3

2021-02-22 Thread Rao, Abhishek (Nokia - IN/Bangalore)
"op_is_file" : 2, "S3guard_metadatastore_throttle_rate99thPercentileFrequency (Hz)" : 0 }, "diagnostics" : { "fs.s3a.metadatastore.impl" : "org.apache.hadoop.fs.s3a.s3guard.NullMetadataStore", "fs.s3a.committer.magic.enabled

Trigger on GroupStateTimeout with no new data in group

2021-02-11 Thread Abhishek Gupta
Hi All, I had a question about modeling a user session kind of analytics use-case in Spark Structured Streaming. Is there a way to model something like this using Arbitrary stateful Spark streaming User session -> reads a few FAQS on a website and then decides to create a ticket or not FAQ

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

2020-09-10 Thread Rao, Abhishek (Nokia - IN/Bangalore)
were seeing discrepancy in query execution time on S3 with Spark 3.0.0. Thanks and Regards, Abhishek From: Gourav Sengupta Sent: Wednesday, August 26, 2020 5:49 PM To: Rao, Abhishek (Nokia - IN/Bangalore) Cc: user Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

2020-08-26 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Yeah… Not sure if I’m missing any configurations which is causing this issue. Any suggestions? Thanks and Regards, Abhishek From: Gourav Sengupta Sent: Wednesday, August 26, 2020 2:35 PM To: Rao, Abhishek (Nokia - IN/Bangalore) Cc: user@spark.apache.org Subject: Re: Spark 3.0 using S3 taking

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

2020-08-26 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi Gourav, Yes. We’re using s3a. Thanks and Regards, Abhishek From: Gourav Sengupta Sent: Wednesday, August 26, 2020 1:18 PM To: Rao, Abhishek (Nokia - IN/Bangalore) Cc: user@spark.apache.org Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries Hi, are you using

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

2020-08-25 Thread Rao, Abhishek (Nokia - IN/Bangalore)
whereas in case of HDFS, it is only 4.5 GB. Any idea why this difference is there? Thanks and Regards, Abhishek From: Luca Canali Sent: Monday, August 24, 2020 7:18 PM To: Rao, Abhishek (Nokia - IN/Bangalore) Cc: user@spark.apache.org Subject: RE: Spark 3.0 using S3 taking long time for some set

RE: Spark Thrift Server in Kubernetes deployment

2020-06-22 Thread Rao, Abhishek (Nokia - IN/Bangalore)
, Abhishek From: Subash K Sent: Monday, June 22, 2020 9:00 AM To: user@spark.apache.org Subject: Spark Thrift Server in Kubernetes deployment Hi, We are currently using Spark 2.4.4 with Spark Thrift Server (STS) to expose a JDBC interface to the reporting tools to generate report from Spark tables. Now

RE: [External Sender] Spark Executor pod not getting created on kubernetes cluster

2019-10-07 Thread Rao, Abhishek (Nokia - IN/Bangalore)
with overlay network. Thanks and Regards, Abhishek From: manish gupta Sent: 01 October 2019 PM 09:20 To: Prudhvi Chennuru (CONT) Cc: user Subject: Re: [External Sender] Spark Executor pod not getting created on kubernetes cluster Kube-api server logs are not enabled. I will enable and check and get back

RE: web access to sparkUI on docker or k8s pods

2019-08-27 Thread Rao, Abhishek (Nokia - IN/Bangalore)
node) for now. There is option of using nodeport as well. That also works. Thanks and Regards, Abhishek From: Yaniv Harpaz Sent: Tuesday, August 27, 2019 7:34 PM To: user@spark.apache.org Subject: web access to sparkUI on docker or k8s pods hello guys, when I launch driver pods or even when I use

Re: New Spark Datasource for Hive ACID tables

2019-07-27 Thread Abhishek Somani
I realised that the build instructions in the README.md were not very clear due to some recent changes. I have updated those now. Thanks, Abhishek Somani On Sun, Jul 28, 2019 at 7:53 AM naresh Goud wrote: > Thanks Abhishek. > I will check it out. > > Thank you, > Naresh >

Re: New Spark Datasource for Hive ACID tables

2019-07-27 Thread Abhishek Somani
can just use it as: spark-shell --packages qubole:spark-acid:0.4.0-s_2.11 ...and it will be automatically fetched and used. Thanks, Abhishek On Sun, Jul 28, 2019 at 4:42 AM naresh Goud wrote: > It looks there is some internal dependency missing. > > libraryDependencies ++= Seq( > &qu

Re: New Spark Datasource for Hive ACID tables

2019-07-26 Thread Abhishek Somani
Hey Naresh, Thanks for your question. Yes it will work! Thanks, Abhishek Somani On Fri, Jul 26, 2019 at 7:08 PM naresh Goud wrote: > Thanks Abhishek. > > Will it work on hive acid table which is not compacted ? i.e table having > base and delta files? > > Let’s say hive a

New Spark Datasource for Hive ACID tables

2019-07-26 Thread Abhishek Somani
e tables via Spark as well. The datasource is also available as a spark package, and instructions on how to use it are available on the Github page <https://github.com/qubole/spark-acid>. We welcome your feedback and suggestions. Thanks, Abhishek Somani

RE: Spark on Kubernetes - log4j.properties not read

2019-06-10 Thread Rao, Abhishek (Nokia - IN/Bangalore)
in this case. You could try to build the container by placing the log4j.properties at some other location and set the same in spark.driver.extraJavaOptions Thanks and Regards, Abhishek From: Dave Jaffe Sent: Tuesday, June 11, 2019 6:45 AM To: user@spark.apache.org Subject: Spark on Kubernetes

Spark Metrics : Job Remains In "Running" State

2019-03-18 Thread Jain, Abhishek 3. (Nokia - IN/Bangalore)
Please let me know if it is the expected behavior ? Regards, Abhishek Jain

RE: Spark streaming filling the disk with logs

2019-02-14 Thread Jain, Abhishek 3. (Nokia - IN/Bangalore)
for driver and executor/s) --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/log4j.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/log4j.properties" Regards, Abhishek Jain From: em...@yeikel.com Sent: Friday, February 15, 201

RE: Spark streaming filling the disk with logs

2019-02-14 Thread Jain, Abhishek 3. (Nokia - IN/Bangalore)
log4j.appender.rolling.maxBackupIndex=5 log4j.appender.rolling.file=/var/log/spark/ log4j.logger.org.apache.spark= This means log4j will roll the log file by 50MB and keep only 5 recent files. These files are saved in /var/log/spark directory, with filename mentioned. Regards, Abhishek Jain From

RE: Spark streaming filling the disk with logs

2019-02-14 Thread Jain, Abhishek 3. (Nokia - IN/Bangalore)
=, log4j.logger.org.apache.parquet= etc.. These properties can be set in the conf/log4j .properties file. Hope this helps!  Regards, Abhishek Jain From: Deepak Sharma Sent: Thursday, February 14, 2019 12:10 PM To: spark users Subject: Spark streaming filling the disk with logs Hi All I am running a spark

RE: Spark UI History server on Kubernetes

2019-01-23 Thread Rao, Abhishek (Nokia - IN/Bangalore)
spark.eventLog.dir Thanks and Regards, Abhishek From: Battini Lakshman Sent: Wednesday, January 23, 2019 1:55 PM To: Rao, Abhishek (Nokia - IN/Bangalore) Subject: Re: Spark UI History server on Kubernetes HI Abhishek, Thank you for your response. Could you please let me know the properties you configured

RE: Spark UI History server on Kubernetes

2019-01-22 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi, We’ve setup spark-history service (based on spark 2.4) on K8S. UI works perfectly fine when running on NodePort. We’re facing some issues when on ingress. Please let us know what kind of inputs do you need? Thanks and Regards, Abhishek From: Battini Lakshman Sent: Tuesday, January 22

Re: [Spark Structured Streaming on K8S]: Debug - File handles/descriptor (unix pipe) leaking

2018-07-23 Thread Abhishek Tripathi
/6f838adf6651491bd4f263956f403c74 Thanks. Best Regards, *Abhishek Tripath* On Thu, Jul 19, 2018 at 10:02 AM Abhishek Tripathi wrote: > Hello All!​​ > I am using spark 2.3.1 on kubernetes to run a structured streaming spark > job which read stream from Kafka , perform some window aggregation and > output s

[Spark Structured Streaming on K8S]: Debug - File handles/descriptor (unix pipe) leaking

2018-07-19 Thread Abhishek Tripathi
1 (both topic has 20 partition and getting almost 5k records/s ) Hadoop version (Using hdfs for check pointing) : 2.7.2 Thank you for any help. Best Regards, *Abhishek Tripathi*

Ingesting data in parallel across workers in Data Frame

2017-01-20 Thread Abhishek Gupta
ata-sources/sql-databases.html>The problem I am facing that I don't have a numeric column which can be used for achieving the partition. Any help would be appreciated. Thank You --Abhishek

Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-09 Thread Abhishek Bhandari
master as the slave, it's a single >> machine configuration). >> >> I really don't understand why this is happening since the same >> configuration but using a Spark 2.0.0 is running fine within Vagrant. >> Could someone please help? >> >> thanks in advance, >&g

Insert a JavaPairDStream into multiple cassandra table on the basis of key.

2016-11-03 Thread Abhishek Anand
Hi All, I have a JavaPairDStream. I want to insert this dstream into multiple cassandra tables on the basis of key. One approach is to filter each key and then insert it into cassandra table. But this would call filter operation "n" times depending on the number of keys. Is there any better

MapWithState with large state

2016-10-31 Thread Abhishek Singh
Can it handle state that is large than what memory will hold?

Restful WS for Spark

2016-09-30 Thread ABHISHEK
. Thanks, Abhishek

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread ABHISHEK
I have tried with hdfs/tmp location but it didn't work. Same error. On 23 Sep 2016 19:37, "Aditya" <aditya.calangut...@augmentiq.co.in> wrote: > Hi Abhishek, > > Try below spark submit. > spark-submit --master yarn --deploy-mode cluster --files hdfs:// > a

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread ABHISHEK
, please help to correct it. Aditya: I have attached code here for reference. --File option will distributed reference file to all node but Kie session is not able to pickup it. Thanks, Abhishek On Fri, Sep 23, 2016 at 2:25 PM, Steve Loughran <ste...@hortonworks.com> wrote: > > On

Spark Yarn Cluster with Reference File

2016-09-23 Thread ABHISHEK
pl.java:58) ... 19 more -- Cheers, Abhishek

Re: Finding unique across all columns in dataset

2016-09-19 Thread Abhishek Anand
<sauravsinh...@gmail.com> wrote: > >> You can use distinct over you data frame or rdd >> >> rdd.distinct >> >> It will give you distinct across your row. >> >> On Mon, Sep 19, 2016 at 2:35 PM, Abhishek Anand <abhis.anan...@gmail.com>

Finding unique across all columns in dataset

2016-09-19 Thread Abhishek Anand
I have an rdd which contains 14 different columns. I need to find the distinct across all the columns of rdd and write it to hdfs. How can I acheive this ? Is there any distributed data structure that I can use and keep on updating it as I traverse the new rows ? Regards, Abhi

UNSUBSCRIBE

2016-08-09 Thread abhishek singh

Relative path in absolute URI

2016-08-03 Thread Abhishek Ranjan
up incorrect path. Did any one encountered similar problem with spark 2.0? With Thanks, Abhishek

Re: Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread Abhishek Anand
.length]; > for (int i = 0; i < concatColumns.length; i++) { > concatColumns[i]=df.col(array[i]); > } > > return functions.concat(concatColumns).alias(fieldName); > } > > > > On Mon, Jul 18, 2016 at 2:14 PM, Abhishek Anand <abhis.anan...@g

Re: Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread Abhishek Anand
for (int i = 0; i < columns.length; i++) { > selectColumns[i]=df.col(columns[i]); > } > > > selectColumns[columns.length]=functions.concat(df.col("firstname"), > df.col("lastname")); > > df.select(selectColumns).sh

Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread Abhishek Anand
Hi, I have a dataframe say having C0,C1,C2 and so on as columns. I need to create interaction variables to be taken as input for my program. For eg - I need to create I1 as concatenation of C0,C3,C5 Similarly, I2 = concat(C4,C5) and so on .. How can I achieve this in my Java code for

Change spark dataframe to LabeledPoint in Java

2016-06-30 Thread Abhishek Anand
Hi , I have a dataframe which i want to convert to labeled point. DataFrame labeleddf = model.transform(newdf).select("label","features"); How can I convert this to a LabeledPoint to use in my Logistic Regression model. I could do this in scala using val trainData = labeleddf.map(row =>

Re: spark.hadoop.dfs.replication parameter not working for kafka-spark streaming

2016-05-31 Thread Abhishek Anand
I also tried jsc.sparkContext().sc().hadoopConfiguration().set("dfs.replication", "2") But, still its not working. Any ideas why its not working ? Abhi On Tue, May 31, 2016 at 4:03 PM, Abhishek Anand <abhis.anan...@gmail.com> wrote: > My spark streaming

spark.hadoop.dfs.replication parameter not working for kafka-spark streaming

2016-05-31 Thread Abhishek Anand
My spark streaming checkpoint directory is being written to HDFS with default replication factor of 3. In my streaming application where I am listening from kafka and setting the dfs.replication = 2 as below the files are still being written with replication factor=3 SparkConf sparkConfig = new

Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Abhishek Anand
features (columns of type string) will be one-hot > encoded automatically. > So pre-processing like `as.factor` is not necessary, you can directly feed > your data to the model training. > > Thanks > Yanbo > > 2016-05-30 2:06 GMT-07:00 Abhishek Anand <abhis.anan...@gma

Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Abhishek Anand
Hi , I want to run glm variant of sparkR for my data that is there in a csv file. I see that the glm function in sparkR takes a spark dataframe as input. Now, when I read a file from csv and create a spark dataframe, how could I take care of the factor variables/columns in my data ? Do I need

RE: GraphX Java API

2016-05-29 Thread Kumar, Abhishek (US - Bengaluru)
java/org/apache/spark/graphx/util/package-frame.html> Aren’t they meant to be used with JAVA? Thanks From: Santoshakhilesh [mailto:santosh.akhil...@huawei.com] Sent: Friday, May 27, 2016 4:52 PM To: Kumar, Abhishek (US - Bengaluru) <abhishekkuma...@deloitte.com>; user@spark.apache.org Subjec

GraphX Java API

2016-05-27 Thread Kumar, Abhishek (US - Bengaluru)
Hi, We are trying to consume the Java API for GraphX, but there is no documentation available online on the usage or examples. It would be great if we could get some examples in Java. Thanks and regards, Abhishek Kumar Products & Services | iLab Deloitte Consulting LLP Block ‘C’, Divya

Calculating log-loss for the trained model in Spark ML

2016-05-03 Thread Abhishek Anand
I am building a ML pipeline for logistic regression. val lr = new LogisticRegression() lr.setMaxIter(100).setRegParam(0.001) val pipeline = new Pipeline().setStages(Array(geoDimEncoder,clientTypeEncoder, devTypeDimIdEncoder,pubClientIdEncoder,tmpltIdEncoder,

Re: removing header from csv file

2016-05-03 Thread Abhishek Anand
You can use this function to remove the header from your dataset(applicable to RDD) def dropHeader(data: RDD[String]): RDD[String] = { data.mapPartitionsWithIndex((idx, lines) => { if (idx == 0) { lines.drop(1) } lines }) } Abhi On Wed, Apr 27, 2016 at

Clear Threshold in Logistic Regression ML Pipeline

2016-05-03 Thread Abhishek Anand
Hi All, I am trying to build a logistic regression pipeline in ML. How can I clear the threshold which by default is 0.5. In mllib I am able to clear the threshold to get the raw predictions using model.clearThreshold() function. Regards, Abhi

RE: removing header from csv file

2016-04-27 Thread Mishra, Abhishek
You should be doing something like this: data = sc.textFile('file:///path1/path/test1.csv') header = data.first() #extract header #print header data = data.filter(lambda x:x !=header) #print data Hope it helps. Sincerely, Abhishek +91-7259028700 From: nihed mbarek [mailto:nihe...@gmail.com

Fwd: Facing Unusual Behavior with the executors in spark streaming

2016-04-05 Thread Abhishek Anand
Hi , Needed inputs for a couple of issue that I am facing in my production environment. I am using spark version 1.4.0 spark streaming. 1) It so happens that the worker is lost on a machine and the executor still shows up in the executor's tab in the UI. Even when I kill a worker using kill -9

Re: Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-04-01 Thread Abhishek Anand
(SingleThreadEventExecutor.java:116) ... 1 more Cheers !! Abhi On Fri, Apr 1, 2016 at 9:04 AM, Abhishek Anand <abhis.anan...@gmail.com> wrote: > This is what I am getting in the executor logs > > 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while > revert

Re: Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-03-31 Thread Abhishek Anand
> > On Thu, Mar 31, 2016 at 11:32 AM, Abhishek Anand <abhis.anan...@gmail.com> > wrote: > >> >> Hi, >> >> Why is it so that when my disk space is full on one of the workers then >> the executor on that worker becomes unresponsive and the jobs on

Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-03-31 Thread Abhishek Anand
Hi, Why is it so that when my disk space is full on one of the workers then the executor on that worker becomes unresponsive and the jobs on that worker fails with the exception 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file

Output the data to external database at particular time in spark streaming

2016-03-08 Thread Abhishek Anand
I have a spark streaming job where I am aggregating the data by doing reduceByKeyAndWindow with inverse function. I am keeping the data in memory for upto 2 hours and In order to output the reduced data to an external storage I conditionally need to puke the data to DB say at every 15th minute of

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-29 Thread Abhishek Anand
<shixi...@databricks.com > wrote: > Sorry that I forgot to tell you that you should also call `rdd.count()` > for "reduceByKey" as well. Could you try it and see if it works? > > On Sat, Feb 27, 2016 at 1:17 PM, Abhishek Anand <abhis.anan...@gmail.com> >

Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-29 Thread Abhishek Anand
tall the snappy native library in your new machines? > > On Fri, Feb 26, 2016 at 11:05 PM, Abhishek Anand <abhis.anan...@gmail.com> > wrote: > >> Any insights on this ? >> >> On Fri, Feb 26, 2016 at 1:21 PM, Abhishek Anand <abhis.anan...@gmail.com> >> w

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-27 Thread Abhishek Anand
; } > }); > return stateDStream.stateSnapshots(); > > > On Mon, Feb 22, 2016 at 12:25 PM, Abhishek Anand <abhis.anan...@gmail.com> > wrote: > >> Hi Ryan, >> >> Reposting the code. >> >> Basically my use case is something like - I am re

Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-26 Thread Abhishek Anand
Any insights on this ? On Fri, Feb 26, 2016 at 1:21 PM, Abhishek Anand <abhis.anan...@gmail.com> wrote: > On changing the default compression codec which is snappy to lzf the > errors are gone !! > > How can I fix this using snappy as the codec ? > > Is there any downsid

Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-25 Thread Abhishek Anand
On changing the default compression codec which is snappy to lzf the errors are gone !! How can I fix this using snappy as the codec ? Is there any downside of using lzf as snappy is the default codec that ships with spark. Thanks !!! Abhi On Mon, Feb 22, 2016 at 7:42 PM, Abhishek Anand

RE: LDA topic Modeling spark + python

2016-02-24 Thread Mishra, Abhishek
Hello All, If someone has any leads on this please help me. Sincerely, Abhishek From: Mishra, Abhishek Sent: Wednesday, February 24, 2016 5:11 PM To: user@spark.apache.org Subject: LDA topic Modeling spark + python Hello All, I am doing a LDA model, please guide me with something. I

LDA topic Modeling spark + python

2016-02-24 Thread Mishra, Abhishek
ed status. The topic length being 2000 and value of k or number of words being 3. Please, if you can provide me with some link or some code base on spark with python ; I would be grateful. Looking forward for a reply, Sincerely, Abhishek

value from groubBy paired rdd

2016-02-23 Thread Mishra, Abhishek
ouped=pairs.groupByKey()#grouping values as per key grouped_val= grouped.map(lambda x : (list(x[1]))).collect() print grouped_val Thanks in Advance, Sincerely, Abhishek

Query Kafka Partitions from Spark SQL

2016-02-23 Thread Abhishek Anand
Is there a way to query the json (or any other format) data stored in kafka using spark sql by providing the offset range on each of the brokers ? I just want to be able to query all the partitions in a sq manner. Thanks ! Abhi

RE: Sample project on Image Processing

2016-02-22 Thread Mishra, Abhishek
Thank you Everyone. I am to work on PoC with 2 types of images, that basically will be two PoC’s. Face recognition and Map data processing. I am looking to these links and hopefully will get an idea. Thanks again. Will post the queries as and when I get doubts. Sincerely, Abhishek From: ndj

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Abhishek Anand
ry 10 batches. > However, there is a known issue that prevents mapWithState from > checkpointing in some special cases: > https://issues.apache.org/jira/browse/SPARK-6847 > > On Mon, Feb 22, 2016 at 5:47 AM, Abhishek Anand <abhis.anan...@gmail.com> > wrote: > >> Any Insi

java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-22 Thread Abhishek Anand
Hi , I am getting the following exception on running my spark streaming job. The same job has been running fine since long and when I added two new machines to my cluster I see the job failing with the following exception. 16/02/22 19:23:01 ERROR Executor: Exception in task 2.0 in stage

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Abhishek Anand
Any Insights on this one ? Thanks !!! Abhi On Mon, Feb 15, 2016 at 11:08 PM, Abhishek Anand <abhis.anan...@gmail.com> wrote: > I am now trying to use mapWithState in the following way using some > example codes. But, by looking at the DAG it does not seem to checkpoint > the

Sample project on Image Processing

2016-02-22 Thread Mishra, Abhishek
Hello, I am working on image processing samples. Was wondering if anyone has worked on Image processing project in spark. Please let me know if any sample project or example is available. Please guide in this. Sincerely, Abhishek

Re: Worker's BlockManager Folder not getting cleared

2016-02-17 Thread Abhishek Anand
Looking for answer to this. Is it safe to delete the older files using find . -type f -cmin +200 -name "shuffle*" -exec rm -rf {} \; For a window duration of 2 hours how older files can we delete ? Thanks. On Sun, Feb 14, 2016 at 12:34 PM, Abhishek Anand <abhis.anan...@gmail.com&

Re: Saving Kafka Offsets to Cassandra at begining of each batch in Spark Streaming

2016-02-16 Thread Abhishek Anand
raightforward to just use the normal cassandra client > to save them from the driver. > > On Tue, Feb 16, 2016 at 1:15 AM, Abhishek Anand <abhis.anan...@gmail.com> > wrote: > >> I have a kafka rdd and I need to save the offsets to cassandra table at >> the b

Re: Re: Unusually large deserialisation time

2016-02-16 Thread Abhishek Modi
PS - I don't get this behaviour in all the cases. I did many runs of the same job & i get this behaviour in around 40% of the cases. Task 4 is the bottom row in the metrics table Thank you, Abhishek e: abshkm...@gmail.com p: 91-8233540996 On Tue, Feb 16, 2016 at 11:19 PM, Abhishek

Re: Re: Unusually large deserialisation time

2016-02-16 Thread Abhishek Modi
Darren: this is not the last task of the stage. Thank you, Abhishek e: abshkm...@gmail.com p: 91-8233540996 On Tue, Feb 16, 2016 at 6:52 PM, Darren Govoni <dar...@ontrenet.com> wrote: > There were some posts in this group about it. Another person also saw the > deadlock on

Unusually large deserialisation time

2016-02-16 Thread Abhishek Modi
executors with 1 core for each executor. The cached rdd has 60 blocks. The problem is for every 2-3 runs of the job, there is a task which has an abnormally large deserialisation time. Screenshot attached Thank you,

Abnormally large deserialisation time for some tasks

2016-02-16 Thread Abhishek Modi
I'm doing a mapPartitions on a rdd cached in memory followed by a reduce. Here is my code snippet // myRdd is an rdd consisting of Tuple2[Int,Long] myRdd.mapPartitions(rangify).reduce( (x,y) => (x._1+y._1,x._2 ++ y._2)) //The rangify function def rangify(l: Iterator[ Tuple2[Int,Long] ]) :

Saving Kafka Offsets to Cassandra at begining of each batch in Spark Streaming

2016-02-15 Thread Abhishek Anand
I have a kafka rdd and I need to save the offsets to cassandra table at the begining of each batch. Basically I need to write the offsets of the type Offsets below that I am getting inside foreachRD, to cassandra. The javafunctions api to write to cassandra needs a rdd. How can I create a rdd

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-15 Thread Abhishek Anand
p.m., "Ted Yu" <yuzhih...@gmail.com> wrote: > >> mapWithState supports checkpoint. >> >> There has been some bug fix since release of 1.6.0 >> e.g. >> SPARK-12591 NullPointerException using checkpointed mapWithState with >> KryoSerializer

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-13 Thread Abhishek Anand
there. Is there any other work around ? Cheers!! Abhi On Fri, Feb 12, 2016 at 3:33 AM, Sebastian Piu <sebastian@gmail.com> wrote: > Looks like mapWithState could help you? > On 11 Feb 2016 8:40 p.m., "Abhishek Anand" <abhis.anan...@gmail.com> > wrote: > >&g

Re: Worker's BlockManager Folder not getting cleared

2016-02-13 Thread Abhishek Anand
Hi All, Any ideas on this one ? The size of this directory keeps on growing. I can see there are many files from a day earlier too. Cheers !! Abhi On Tue, Jan 26, 2016 at 7:13 PM, Abhishek Anand <abhis.anan...@gmail.com> wrote: > Hi Adrian, > > I am running spark in

Stateful Operation on JavaPairDStream Help Needed !!

2016-02-11 Thread Abhishek Anand
Hi All, I have an use case like follows in my production environment where I am listening from kafka with slideInterval of 1 min and windowLength of 2 hours. I have a JavaPairDStream where for each key I am getting the same key but with different value,which might appear in the same batch or

Re: Repartition taking place for all previous windows even after checkpointing

2016-02-01 Thread Abhishek Anand
Any insights on this ? On Fri, Jan 29, 2016 at 1:08 PM, Abhishek Anand <abhis.anan...@gmail.com> wrote: > Hi All, > > Can someone help me with the following doubts regarding checkpointing : > > My code flow is something like follows -> > > 1) create direct stre

Repartition taking place for all previous windows even after checkpointing

2016-01-28 Thread Abhishek Anand
Hi All, Can someone help me with the following doubts regarding checkpointing : My code flow is something like follows -> 1) create direct stream from kafka 2) repartition kafka stream 3) mapToPair followed by reduceByKey 4) filter 5) reduceByKeyAndWindow without the inverse function 6)

Re: Worker's BlockManager Folder not getting cleared

2016-01-26 Thread Abhishek Anand
be hitting > https://issues.apache.org/jira/browse/SPARK-10975 > With spark >= 1.6: > https://issues.apache.org/jira/browse/SPARK-12430 > and also be aware of: > https://issues.apache.org/jira/browse/SPARK-12583 > > > On 25/01/2016 07:14, Abhishek Anand wrote: > > Hi

Worker's BlockManager Folder not getting cleared

2016-01-24 Thread Abhishek Anand
Hi All, How long the shuffle files and data files are stored on the block manager folder of the workers. I have a spark streaming job with window duration of 2 hours and slide interval of 15 minutes. When I execute the following command in my block manager path find . -type f -cmin +150 -name

Getting kafka offsets at beginning of spark streaming application

2016-01-11 Thread Abhishek Anand
Hi, Is there a way so that I can fetch the offsets from where the spark streaming starts reading from Kafka when my application starts ? What I am trying is to create an initial RDD with offsest at a particular time passed as input from the command line and the offsets from where my spark

Re: [Spark-SQL] Custom aggregate function for GrouppedData

2016-01-07 Thread Abhishek Gayakwad
docs/latest/api/scala/index.html#org.apache.spark.sql.GroupedDataset> > has > mapGroups, which sounds like what you are looking for. You can also write > a custom Aggregator > <https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html> > >

[Spark-SQL] Custom aggregate function for GrouppedData

2016-01-05 Thread Abhishek Gayakwad
me(values, SalesColumns.getOutputSchema()); } public String csvFormat(Collection collection) { return collection.stream().map(Object::toString).collect(Collectors.joining(",")); } } Please suggest if there is a better way of doing this. Regards, Abhishek

Error on using updateStateByKey

2015-12-18 Thread Abhishek Anand
I am trying to use updateStateByKey but receiving the following error. (Spark Version 1.4.0) Can someone please point out what might be the possible reason for this error. *The method updateStateByKey(Function2) in the type JavaPairDStream is

Re: PairRDD(K, L) to multiple files by key serializing each value in L before

2015-12-16 Thread Abhishek Shivkumar
for ele in line[1]: 4. Write every ele into the file created. 5. Close the file. Do you think this works? Thanks Abhishek S Thank you! With Regards, Abhishek S On Wed, Dec 16, 2015 at 1:05 AM, Daniel Valdivia <h...@danielvaldivia.com> wrote: > Hello everyone, &

How to unpack the values of an item in a RDD so I can create a RDD with multiple items?

2015-12-13 Thread Abhishek Shivkumar
in line[1]]) but it throws an error saying "AttributeError: 'PipelinedRDD' object has no attribute 'flatmap" Can someone tell me the right way to unpack the values to different items in the new RDD? Thank you! With Regards, Abhishek S

Is it possible to pass additional parameters to a python function when used inside RDD.filter method?

2015-12-04 Thread Abhishek Shivkumar
separate parameter to my_func, besides the item that goes into it. How can I do that? I know my_item will refer to one item that comes from my_rdd and how can I pass my own parameter (let's say my_param) as an additional parameter to my_func? Thanks Abhishek S -- *NOTICE AND DISCLAIMER

Re: Is it possible to pass additional parameters to a python function when used inside RDD.filter method?

2015-12-04 Thread Abhishek Shivkumar
Excellent. that did work - thanks. On 4 December 2015 at 12:35, Praveen Chundi <mail.chu...@gmail.com> wrote: > Passing a lambda function should work. > > my_rrd.filter(lambda x: myfunc(x,newparam)) > > Best regards, > Praveen Chundi > > > On 04.12.2015 13:19,

  1   2   >