from:"Jeetendra Gangele"

2016-01-08 Thread Jeetendra Gangele

Re: automatic start of streaming job on failure on YARN

2015-10-03 Thread Jeetendra Gangele

yes in yarn cluster mode.

On 2 October 2015 at 22:10, Ashish Rangole <arang...@gmail.com> wrote:

> Are you running the job in yarn cluster mode?
> On Oct 1, 2015 6:30 AM, "Jeetendra Gangele" <gangele...@gmail.com> wrote:
>
>> We've a streaming application running on yarn and we would like to ensure
>> that is up running 24/7.
>>
>> Is there a way to tell yarn to automatically restart a specific
>> application on failure?
>>
>> There is property yarn.resourcemanager.am.max-attempts which is default
>> set to 2 setting it to bigger value is the solution? Also I did observed
>> this does not seems to work my application is failing and not starting
>> automatically.
>>
>> Mesos has this build in support wondering why yarn is lacking here?
>>
>>
>>
>> Regards
>>
>> jeetendra
>>
>

automatic start of streaming job on failure on YARN

2015-10-01 Thread Jeetendra Gangele

We've a streaming application running on yarn and we would like to ensure
that is up running 24/7.

Is there a way to tell yarn to automatically restart a specific application
on failure?

There is property yarn.resourcemanager.am.max-attempts which is default set
to 2 setting it to bigger value is the solution? Also I did observed this
does not seems to work my application is failing and not starting
automatically.

Mesos has this build in support wondering why yarn is lacking here?



Regards

jeetendra

Re: Deploying spark-streaming application on production

2015-10-01 Thread Jeetendra Gangele

Ya Also I think I need to enable the checkpointing and rather then building
the lineage DAG need to store the RDD data into HDFS.

On 23 September 2015 at 01:04, Adrian Tanase <atan...@adobe.com> wrote:

> btw I re-read the docs and I want to clarify that reliable receiver + WAL
> gives you at least once, not exactly once semantics.
>
> Sent from my iPhone
>
> On 21 Sep 2015, at 21:50, Adrian Tanase <atan...@adobe.com> wrote:
>
> I'm wondering, isn't this the canonical use case for WAL + reliable
> receiver?
>
> As far as I know you can tune Mqtt server to wait for ack on messages (qos
> level 2?).
> With some support from the client libray you could achieve exactly once
> semantics on the read side, if you ack message only after writing it to
> WAL, correct?
>
> -adrian
>
> Sent from my iPhone
>
> On 21 Sep 2015, at 12:35, Petr Novak <oss.mli...@gmail.com> wrote:
>
> In short there is no direct support for it in Spark AFAIK. You will either
> manage it in MQTT or have to add another layer of indirection - either
> in-memory based (observable streams, in-mem db) or disk based (Kafka, hdfs
> files, db) which will keep you unprocessed events.
>
> Now realizing, there is support for backpressure in v1.5.0 but I don't
> know if it could be exploited aka I don't know if it is possible to
> decouple event reading into memory and actual processing code in Spark
> which could be swapped on the fly. Probably not without some custom built
> facility for it.
>
> Petr
>
> On Mon, Sep 21, 2015 at 11:26 AM, Petr Novak <oss.mli...@gmail.com> wrote:
>
>> I should read my posts at least once to avoid so many typos. Hopefully
>> you are brave enough to read through.
>>
>> Petr
>>
>> On Mon, Sep 21, 2015 at 11:23 AM, Petr Novak <oss.mli...@gmail.com>
>> wrote:
>>
>>> I think you would have to persist events somehow if you don't want to
>>> miss them. I don't see any other option there. Either in MQTT if it is
>>> supported there or routing them through Kafka.
>>>
>>> There is WriteAheadLog in Spark but you would have decouple stream MQTT
>>> reading and processing into 2 separate job so that you could upgrade the
>>> processing one assuming the reading one would be stable (without changes)
>>> across versions. But it is problematic because there is no easy way how to
>>> share DStreams between jobs - you would have develop your own facility for
>>> it.
>>>
>>> Alternatively the reading job could could save MQTT event in its the
>>> most raw form into files - to limit need to change code - and then the
>>> processing job would work on top of it using Spark streaming based on
>>> files. I this is inefficient and can get quite complex if you would like to
>>> make it reliable.
>>>
>>> Basically either MQTT supports prsistence (which I don't know) or there
>>> is Kafka for these use case.
>>>
>>> Another option would be I think to place observable streams in between
>>> MQTT and Spark streaming with bakcpressure as far as you could perform
>>> upgrade till buffers fills up.
>>>
>>> I'm sorry that it is not thought out well from my side, it is just a
>>> brainstorm but it might lead you somewhere.
>>>
>>> Regards,
>>> Petr
>>>
>>> On Mon, Sep 21, 2015 at 10:09 AM, Jeetendra Gangele <
>>> gangele...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have an spark streaming application with batch (10 ms) which is
>>>> reading the MQTT channel and dumping the data from MQTT to HDFS.
>>>>
>>>> So suppose if I have to deploy new application jar(with changes in
>>>> spark streaming application) what is the best way to deploy, currently I am
>>>> doing as below
>>>>
>>>> 1.killing the running streaming app using yarn application -kill ID
>>>> 2. and then starting the application again
>>>>
>>>> Problem with above approach is since we are not persisting the events
>>>> in MQTT we will miss the events for the period of deploy.
>>>>
>>>> how to handle this case?
>>>>
>>>> regards
>>>> jeeetndra
>>>>
>>>

Deploying spark-streaming application on production

2015-09-21 Thread Jeetendra Gangele

Hi All,

I have an spark streaming application with batch (10 ms) which is reading
the MQTT channel and dumping the data from MQTT to HDFS.

So suppose if I have to deploy new application jar(with changes in spark
streaming application) what is the best way to deploy, currently I am doing
as below

1.killing the running streaming app using yarn application -kill ID
2. and then starting the application again

Problem with above approach is since we are not persisting the events in
MQTT we will miss the events for the period of deploy.

how to handle this case?

regards
jeeetndra

bad substitution for [hdp.version] Error in spark on YARN job

2015-09-09 Thread Jeetendra Gangele

Hi ,
I am getting below error when running the spark job on YARN with HDP
cluster.
I have installed spark and yarn from Ambari and I am using spark 1.3.1 with
HDP version HDP-2.3.0.0-2557.

My spark-default.conf has correct entry

spark.driver.extraJavaOptions -Dhdp.version=2.3.0.0-2557
spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.0.0-2557

can anybody from HDP reply on this not sure why hp.version is not getting
passed thought it setup in con file correctly. I tried passing same to
spark-submit with --conf "hdp.version=2.3.0.0-2557" same issue no lock.

I am running my job with spark-submit from spark-client machine



Exit code: 1
Exception message:
/hadoop/yarn/local/usercache/hdfs/appcache/application_1441798371988_0002/container_e08_1441798371988_0002_01_05/launch_container.sh:
line 22:
$PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:
bad substitution

Stack trace: ExitCodeException exitCode=1:
/hadoop/yarn/local/usercache/hdfs/appcache/application_1441798371988_0002/container_e08_1441798371988_0002_01_05/launch_container.sh:
line 22:
$PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:
bad substitution

at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1

Re: bad substitution for [hdp.version] Error in spark on YARN job

2015-09-09 Thread Jeetendra Gangele

Finally it did worked out solved it modifying the mapred-site.xml removed
the entry for application yarn master(from this property removed the HDP
version things).



On 9 September 2015 at 17:44, Jeetendra Gangele <gangele...@gmail.com>
wrote:

> Hi ,
> I am getting below error when running the spark job on YARN with HDP
> cluster.
> I have installed spark and yarn from Ambari and I am using spark 1.3.1
> with HDP version HDP-2.3.0.0-2557.
>
> My spark-default.conf has correct entry
>
> spark.driver.extraJavaOptions -Dhdp.version=2.3.0.0-2557
> spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.0.0-2557
>
> can anybody from HDP reply on this not sure why hp.version is not getting
> passed thought it setup in con file correctly. I tried passing same to
> spark-submit with --conf "hdp.version=2.3.0.0-2557" same issue no lock.
>
> I am running my job with spark-submit from spark-client machine
>
>
>
> Exit code: 1
> Exception message:
> /hadoop/yarn/local/usercache/hdfs/appcache/application_1441798371988_0002/container_e08_1441798371988_0002_01_05/launch_container.sh:
> line 22:
> $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:
> bad substitution
>
> Stack trace: ExitCodeException exitCode=1:
> /hadoop/yarn/local/usercache/hdfs/appcache/application_1441798371988_0002/container_e08_1441798371988_0002_01_05/launch_container.sh:
> line 22:
> $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:
> bad substitution
>
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
> at org.apache.hadoop.util.Shell.run(Shell.java:456)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Container exited with a non-zero exit code 1
>
>

Re: Sending yarn application logs to web socket

2015-09-08 Thread Jeetendra Gangele

1.in order to change log4j.properties at the name node, u can change
/home/hadoop/log4j.properties.

2.in order to change log4j.properties for the container logs, u need to
change it at the yarn containers jar, since they hard-coded loading the
file directly from project resources.

2.1 ssh to the slave (on EMR u can also simply add this as bootstrap
action, so u dont need to ssh to each of the nodes).

2.2 override the container-log4j.properties at the jar resources:

jar uf
/home/hadoop/share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.2.0.jar
*container-log4j.properties*

On 8 September 2015 at 05:47, Yana Kadiyska <yana.kadiy...@gmail.com> wrote:

> Hopefully someone will give you a more direct answer but whenever I'm
> having issues with log4j I always try -Dlog4j.debug=true.This will tell
> you which log4j settings are getting picked up from where. I've spent
> countless hours due to typos in the file, for example.
>
> On Mon, Sep 7, 2015 at 11:47 AM, Jeetendra Gangele <gangele...@gmail.com>
> wrote:
>
>> I also tried placing my costomized log4j.properties file under
>> src/main/resources still no luck.
>>
>> won't above step modify the default YARN and spark  log4j.properties  ?
>>
>> anyhow its still taking log4j.properties from YARn.
>>
>>
>>
>> On 7 September 2015 at 19:25, Jeetendra Gangele <gangele...@gmail.com>
>> wrote:
>>
>>> anybody here to help?
>>>
>>>
>>>
>>> On 7 September 2015 at 17:53, Jeetendra Gangele <gangele...@gmail.com>
>>> wrote:
>>>
>>>> Hi All I have been trying to send my application related logs to socket
>>>> so that we can write log stash and check the application logs.
>>>>
>>>> here is my log4j.property file
>>>>
>>>> main.logger=RFA,SA
>>>>
>>>> log4j.appender.SA=org.apache.log4j.net.SocketAppender
>>>> log4j.appender.SA.Port=4560
>>>> log4j.appender.SA.RemoteHost=hadoop07.housing.com
>>>> log4j.appender.SA.ReconnectionDelay=1
>>>> log4j.appender.SA.Application=NM-${user.dir}
>>>> # Ignore messages below warning level from Jetty, because it's a bit
>>>> verbose
>>>> log4j.logger.org.spark-project.jetty=WARN
>>>> log4j.logger.org.apache.hadoop=WARN
>>>>
>>>>
>>>> I am launching my spark job using below common on YARN-cluster mode
>>>>
>>>> *spark-submit --name data-ingestion --master yarn-cluster --conf
>>>> spark.custom.configuration.file=hdfs://10.1.6.186/configuration/binning-dev.conf
>>>> <http://10.1.6.186/configuration/binning-dev.conf> --files
>>>> /usr/hdp/current/spark-client/Runnable/conf/log4j.properties --conf
>>>> "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
>>>> --conf
>>>> "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
>>>> --class com.housing.spark.streaming.Binning
>>>> /usr/hdp/current/spark-client/Runnable/dsl-data-ingestion-all.jar*
>>>>
>>>>
>>>> *Can anybody please guide me why i am not getting the logs the socket?*
>>>>
>>>>
>>>> *I followed many pages listing below without success*
>>>>
>>>> http://tech-stories.com/2015/02/12/setting-up-a-central-logging-infrastructure-for-hadoop-and-spark/#comment-208
>>>>
>>>> http://stackoverflow.com/questions/22918720/custom-log4j-appender-in-hadoop-2
>>>>
>>>> http://stackoverflow.com/questions/9081625/override-log4j-properties-in-hadoop
>>>>
>>>>
>>>
>>
>>
>>
>>
>

Sending yarn application logs to web socket

2015-09-07 Thread Jeetendra Gangele

Hi All I have been trying to send my application related logs to socket so
that we can write log stash and check the application logs.

here is my log4j.property file

main.logger=RFA,SA

log4j.appender.SA=org.apache.log4j.net.SocketAppender
log4j.appender.SA.Port=4560
log4j.appender.SA.RemoteHost=hadoop07.housing.com
log4j.appender.SA.ReconnectionDelay=1
log4j.appender.SA.Application=NM-${user.dir}
# Ignore messages below warning level from Jetty, because it's a bit verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.apache.hadoop=WARN


I am launching my spark job using below common on YARN-cluster mode

*spark-submit --name data-ingestion --master yarn-cluster --conf
spark.custom.configuration.file=hdfs://10.1.6.186/configuration/binning-dev.conf
 --files
/usr/hdp/current/spark-client/Runnable/conf/log4j.properties --conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
--conf
"spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
--class com.housing.spark.streaming.Binning
/usr/hdp/current/spark-client/Runnable/dsl-data-ingestion-all.jar*


*Can anybody please guide me why i am not getting the logs the socket?*


*I followed many pages listing below without success*
http://tech-stories.com/2015/02/12/setting-up-a-central-logging-infrastructure-for-hadoop-and-spark/#comment-208
http://stackoverflow.com/questions/22918720/custom-log4j-appender-in-hadoop-2
http://stackoverflow.com/questions/9081625/override-log4j-properties-in-hadoop

Re: Sending yarn application logs to web socket

2015-09-07 Thread Jeetendra Gangele

I also tried placing my costomized log4j.properties file under
src/main/resources still no luck.

won't above step modify the default YARN and spark  log4j.properties  ?

anyhow its still taking log4j.properties from YARn.



On 7 September 2015 at 19:25, Jeetendra Gangele <gangele...@gmail.com>
wrote:

> anybody here to help?
>
>
>
> On 7 September 2015 at 17:53, Jeetendra Gangele <gangele...@gmail.com>
> wrote:
>
>> Hi All I have been trying to send my application related logs to socket
>> so that we can write log stash and check the application logs.
>>
>> here is my log4j.property file
>>
>> main.logger=RFA,SA
>>
>> log4j.appender.SA=org.apache.log4j.net.SocketAppender
>> log4j.appender.SA.Port=4560
>> log4j.appender.SA.RemoteHost=hadoop07.housing.com
>> log4j.appender.SA.ReconnectionDelay=1
>> log4j.appender.SA.Application=NM-${user.dir}
>> # Ignore messages below warning level from Jetty, because it's a bit
>> verbose
>> log4j.logger.org.spark-project.jetty=WARN
>> log4j.logger.org.apache.hadoop=WARN
>>
>>
>> I am launching my spark job using below common on YARN-cluster mode
>>
>> *spark-submit --name data-ingestion --master yarn-cluster --conf
>> spark.custom.configuration.file=hdfs://10.1.6.186/configuration/binning-dev.conf
>> <http://10.1.6.186/configuration/binning-dev.conf> --files
>> /usr/hdp/current/spark-client/Runnable/conf/log4j.properties --conf
>> "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
>> --conf
>> "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
>> --class com.housing.spark.streaming.Binning
>> /usr/hdp/current/spark-client/Runnable/dsl-data-ingestion-all.jar*
>>
>>
>> *Can anybody please guide me why i am not getting the logs the socket?*
>>
>>
>> *I followed many pages listing below without success*
>>
>> http://tech-stories.com/2015/02/12/setting-up-a-central-logging-infrastructure-for-hadoop-and-spark/#comment-208
>>
>> http://stackoverflow.com/questions/22918720/custom-log4j-appender-in-hadoop-2
>>
>> http://stackoverflow.com/questions/9081625/override-log4j-properties-in-hadoop
>>
>>
>

Re: Sending yarn application logs to web socket

2015-09-07 Thread Jeetendra Gangele

anybody here to help?



On 7 September 2015 at 17:53, Jeetendra Gangele <gangele...@gmail.com>
wrote:

> Hi All I have been trying to send my application related logs to socket so
> that we can write log stash and check the application logs.
>
> here is my log4j.property file
>
> main.logger=RFA,SA
>
> log4j.appender.SA=org.apache.log4j.net.SocketAppender
> log4j.appender.SA.Port=4560
> log4j.appender.SA.RemoteHost=hadoop07.housing.com
> log4j.appender.SA.ReconnectionDelay=1
> log4j.appender.SA.Application=NM-${user.dir}
> # Ignore messages below warning level from Jetty, because it's a bit
> verbose
> log4j.logger.org.spark-project.jetty=WARN
> log4j.logger.org.apache.hadoop=WARN
>
>
> I am launching my spark job using below common on YARN-cluster mode
>
> *spark-submit --name data-ingestion --master yarn-cluster --conf
> spark.custom.configuration.file=hdfs://10.1.6.186/configuration/binning-dev.conf
> <http://10.1.6.186/configuration/binning-dev.conf> --files
> /usr/hdp/current/spark-client/Runnable/conf/log4j.properties --conf
> "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
> --conf
> "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
> --class com.housing.spark.streaming.Binning
> /usr/hdp/current/spark-client/Runnable/dsl-data-ingestion-all.jar*
>
>
> *Can anybody please guide me why i am not getting the logs the socket?*
>
>
> *I followed many pages listing below without success*
>
> http://tech-stories.com/2015/02/12/setting-up-a-central-logging-infrastructure-for-hadoop-and-spark/#comment-208
>
> http://stackoverflow.com/questions/22918720/custom-log4j-appender-in-hadoop-2
>
> http://stackoverflow.com/questions/9081625/override-log4j-properties-in-hadoop
>
>

Re: Loading already existing tables in spark shell

2015-08-25 Thread Jeetendra Gangele

In spark shell use database  not working saying use not found in the
shell?
did you ran this with scala shell ?

On 24 August 2015 at 18:26, Ishwardeep Singh ishwardeep.si...@impetus.co.in
 wrote:

 Hi Jeetendra,


 I faced this issue. I did not specify the database where this table
 exists. Please set the database by using use database command before
 executing the query.


 Regards,

 Ishwardeep

 --
 *From:* Jeetendra Gangele gangele...@gmail.com
 *Sent:* Monday, August 24, 2015 5:47 PM
 *To:* user
 *Subject:* Loading already existing tables in spark shell

 Hi All I have few tables in hive and I wanted to run query against them
 with spark as execution engine.

 Can I direct;y load these tables in spark shell and run query?

 I tried with
 1.val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
 2.qlContext.sql(FROM event_impressions select count(*)) where
 event_impressions is the table name.

 It give me error saying org.apache.spark.sql.AnalysisException: no such
 table event_impressions; line 1 pos 5

 Does anybody hit similar issues?


 regards
 jeetendra

 --






 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that the
 communication is free of errors, virus, interception or interference.

Re: spark not launching in yarn-cluster mode

2015-08-25 Thread Jeetendra Gangele

when I am launching with yarn-client also its giving me below error
bin/spark-sql --master yarn-client
15/08/25 13:53:20 ERROR YarnClientSchedulerBackend: Yarn application has
already exited with state FINISHED!
Exception in thread Yarn application state monitor
org.apache.spark.SparkException: Error asking standalone scheduler to shut
down executors
at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261)
at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
at
org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139)
Caused by: java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257)

On 25 August 2015 at 14:26, Yanbo Liang yblia...@gmail.com wrote:

 spark-shell and spark-sql can not be deployed with yarn-cluster mode,
 because you need to make spark-shell or spark-sql scripts run on your local
 machine rather than container of YARN cluster.

 2015-08-25 16:19 GMT+08:00 Jeetendra Gangele gangele...@gmail.com:

 Hi All i am trying to launch the spark shell with --master yarn-cluster
 its giving below error.
 why this is not supported?


 bin/spark-sql --master yarn-cluster
 Error: Cluster deploy mode is not applicable to Spark SQL shell.
 Run with --help for usage help or --verbose for debug output


 Regards
 Jeetendra

creating data warehouse with Spark and running query with Hive

2015-08-19 Thread Jeetendra Gangele

HI All,

I have a data in HDFS partition with Year/month/data/event_type. And I am
creating a hive tables with this data, this data is in JSON so I am using
json serve and creating hive tables.
 below is the code
  val jsonFile =
hiveContext.read.json(hdfs://localhost:9000/housing/events_real/category=Impressions/date=1007465766/*)
jsonFile.toDF().printSchema()
jsonFile.write.saveAsTable(JsonFileTable)
jsonFile.toDF().printSchema()
val events = hiveContext.sql(SELECT category, uid FROM JsonFileTable)
events.map(e = Event:  + e).collect().foreach(println)

saveAstable  failing with Error saying MKDir failed to create the directory
 ,anybody has any idea?

Re: Data from PostgreSQL to Spark

2015-08-03 Thread Jeetendra Gangele

Here is the solution this looks perfect for me.
thanks for all your help

http://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/

On 28 July 2015 at 23:27, Jörn Franke jornfra...@gmail.com wrote:

 Can you put some transparent cache in front of the database? Or some jdbc
 proxy?

 Le mar. 28 juil. 2015 à 19:34, Jeetendra Gangele gangele...@gmail.com a
 écrit :

 can the source write to Kafka/Flume/Hbase in addition to Postgres? no
 it can't write ,this is due to the fact that there are many applications
 those are producing this postGreSql data.I can't really asked all the teams
 to start writing to some other source.


 velocity of the application is too high.






 On 28 July 2015 at 21:50, santosh...@gmail.com wrote:

 Sqoop’s incremental data fetch will reduce the data size you need to
 pull from source, but then by the time that incremental data fetch is
 complete, is it not current again, if velocity of the data is high?

 May be you can put a trigger in Postgres to send data to the big data
 cluster as soon as changes are made. Or as I was saying in another email,
 can the source write to Kafka/Flume/Hbase in addition to Postgres?

 Sent from Windows Mail

 *From:* Jeetendra Gangele gangele...@gmail.com
 *Sent:* ‎Tuesday‎, ‎July‎ ‎28‎, ‎2015 ‎5‎:‎43‎ ‎AM
 *To:* santosh...@gmail.com
 *Cc:* ayan guha guha.a...@gmail.com, felixcheun...@hotmail.com,
 user@spark.apache.org

 I trying do that, but there will always data mismatch, since by the time
 scoop is fetching main database will get so many updates. There is
 something called incremental data fetch using scoop but that hits a
 database rather than reading the WAL edit.



 On 28 July 2015 at 02:52, santosh...@gmail.com wrote:

 Why cant you bulk pre-fetch the data to HDFS (like using Sqoop) instead
 of hitting Postgres multiple times?

 Sent from Windows Mail

 *From:* ayan guha guha.a...@gmail.com
 *Sent:* ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎4‎:‎41‎ ‎PM
 *To:* Jeetendra Gangele gangele...@gmail.com
 *Cc:* felixcheun...@hotmail.com, user@spark.apache.org

 You can call dB connect once per partition. Please have a look at
 design patterns of for each construct in document.
 How big is your data in dB? How soon that data changes? You would be
 better off if data is in spark already
 On 28 Jul 2015 04:48, Jeetendra Gangele gangele...@gmail.com wrote:

 Thanks for your reply.

 Parallel i will be hitting around 6000 call to postgreSQl which is not
 good my database will die.
 these calls to database will keeps on increasing.
 Handling millions on request is not an issue with Hbase/NOSQL

 any other alternative?




 On 27 July 2015 at 23:18, felixcheun...@hotmail.com wrote:

 You can have Spark reading from PostgreSQL through the data access
 API. Do you have any concern with that approach since you mention copying
 that data into HBase.

 From: Jeetendra Gangele
 Sent: Monday, July 27, 6:00 AM
 Subject: Data from PostgreSQL to Spark
 To: user

 Hi All

 I have a use case where where I am consuming the Events from RabbitMQ
 using spark streaming.This event has some fields on which I want to query
 the PostgreSQL and bring the data and then do the join between event data
 and PostgreSQl data and put the aggregated data into HDFS, so that I run
 run analytics query over this data using SparkSQL.

 my question is PostgreSQL data in production data so i don't want to
 hit so many times.

 at any given  1 seconds time I may have 3000 events,that means I need
 to fire 3000 parallel query to my PostGreSQl and this data keeps on
 growing, so my database will go down.



 I can't migrate this PostgreSQL data since lots of system using
 it,but I can take this data to some NOSQL like base and query the Hbase,
 but here issue is How can I make sure that Hbase has upto date data?

 Any anyone suggest me best approach/ method to handle this case?

 Regards

 Jeetendra

Re: Spark on YARN

2015-07-30 Thread Jeetendra Gangele

Thanks for information this fixed the issue. Issue was in spark-master
memory when I specify manually 1G for master. it start working

On 30 July 2015 at 14:26, Shao, Saisai saisai.s...@intel.com wrote:

  You’d better also check the log of nodemanager, sometimes because your
 memory usage exceeds the limit of Yarn container’s configuration.



 I’ve met similar problem before, here is the warning log in nodemanager:



 2015-07-07 17:06:07,141 WARN
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container [pid=17385,containerID=container_1436259427993_0001_02_01] is
 running beyond virtual memory limits. Current usage: 318.1 MB of 1 GB
 physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing
 container.



 The default pmem-vmem ratio is 2.1, but seems executor requires more vmem
 when started, so nodemanager will kill it. If you met similar problem, you
 could increase this configuration “yarn.nodemanager.vmem-pmem-ratio”.



 Thanks

 Jerry



 *From:* Jeff Zhang [mailto:zjf...@gmail.com]
 *Sent:* Thursday, July 30, 2015 4:36 PM
 *To:* Jeetendra Gangele
 *Cc:* user
 *Subject:* Re: Spark on YARN



  15/07/30 12:13:35 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15:
 SIGTERM



 AM is killed somehow, may due to preemption. Does it always happen ?
 Resource manager log would be helpful.







 On Thu, Jul 30, 2015 at 4:17 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

  I can't see the application logs here. All the logs are going into
 stderr. can anybody help here?



 On 30 July 2015 at 12:21, Jeetendra Gangele gangele...@gmail.com wrote:

  I am running below command this is default spark PI program but this is
 not running all the log are going in stderr but at the terminal job is
 succeeding .I guess there are con issue job it not at all launching



 /bin/spark-submit --class org.apache.spark.examples.SparkPi --master
 yarn-cluster lib/spark-examples-1.4.1-hadoop2.6.0.jar 10





 Complete log



 SLF4J: Class path contains multiple SLF4J bindings.

 SLF4J: Found binding in 
 [jar:file:/home/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/23/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

 SLF4J: Found binding in 
 [jar:file:/opt/hadoop-2.7.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
 explanation.

 SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

 15/07/30 12:13:31 INFO yarn.ApplicationMaster: Registered signal handlers for 
 [TERM, HUP, INT]

 15/07/30 12:13:32 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
 appattempt_1438090734187_0010_01

 15/07/30 12:13:33 INFO spark.SecurityManager: Changing view acls to: hadoop

 15/07/30 12:13:33 INFO spark.SecurityManager: Changing modify acls to: hadoop

 15/07/30 12:13:33 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(hadoop); users 
 with modify permissions: Set(hadoop)

 15/07/30 12:13:33 INFO yarn.ApplicationMaster: Starting the user application 
 in a separate Thread

 15/07/30 12:13:33 INFO yarn.ApplicationMaster: Waiting for spark context 
 initialization

 15/07/30 12:13:33 INFO yarn.ApplicationMaster: Waiting for spark context 
 initialization ...

 15/07/30 12:13:33 INFO spark.SparkContext: Running Spark version 1.4.1

 15/07/30 12:13:33 WARN spark.SparkConf:

 SPARK_JAVA_OPTS was detected (set to '-Dspark.driver.port=53411').

 This is deprecated in Spark 1.0+.



 Please instead use:

  - ./spark-submit with conf/spark-defaults.conf to set defaults for an 
 application

  - ./spark-submit with --driver-java-options to set -X options for a driver

  - spark.executor.extraJavaOptions to set -X options for executors

  - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master 
 or worker)



 15/07/30 12:13:33 WARN spark.SparkConf: Setting 
 'spark.executor.extraJavaOptions' to '-Dspark.driver.port=53411' as a 
 work-around.

 15/07/30 12:13:33 WARN spark.SparkConf: Setting 
 'spark.driver.extraJavaOptions' to '-Dspark.driver.port=53411' as a 
 work-around.

 15/07/30 12:13:33 INFO spark.SecurityManager: Changing view acls to: hadoop

 15/07/30 12:13:33 INFO spark.SecurityManager: Changing modify acls to: hadoop

 15/07/30 12:13:33 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(hadoop); users 
 with modify permissions: Set(hadoop)

 15/07/30 12:13:33 INFO slf4j.Slf4jLogger: Slf4jLogger started

 15/07/30 12:13:33 INFO Remoting: Starting remoting

 15/07/30 12:13:34 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkDriver@10.21.1.77:53411]

 15/07/30 12:13:34 INFO util.Utils: Successfully started service 'sparkDriver' 
 on port 53411.

 15/07/30 12:13:34 INFO spark.SparkEnv: Registering MapOutputTracker

Re: Spark on YARN

2015-07-30 Thread Jeetendra Gangele

I can't see the application logs here. All the logs are going into stderr.
can anybody help here?

On 30 July 2015 at 12:21, Jeetendra Gangele gangele...@gmail.com wrote:

 I am running below command this is default spark PI program but this is
 not running all the log are going in stderr but at the terminal job is
 succeeding .I guess there are con issue job it not at all launching

 /bin/spark-submit --class org.apache.spark.examples.SparkPi --master
 yarn-cluster lib/spark-examples-1.4.1-hadoop2.6.0.jar 10


 Complete log

 SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in 
 [jar:file:/home/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/23/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in 
 [jar:file:/opt/hadoop-2.7.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
 explanation.
 SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
 15/07/30 12:13:31 INFO yarn.ApplicationMaster: Registered signal handlers for 
 [TERM, HUP, INT]
 15/07/30 12:13:32 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
 appattempt_1438090734187_0010_01
 15/07/30 12:13:33 INFO spark.SecurityManager: Changing view acls to: hadoop
 15/07/30 12:13:33 INFO spark.SecurityManager: Changing modify acls to: hadoop
 15/07/30 12:13:33 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(hadoop); users 
 with modify permissions: Set(hadoop)
 15/07/30 12:13:33 INFO yarn.ApplicationMaster: Starting the user application 
 in a separate Thread
 15/07/30 12:13:33 INFO yarn.ApplicationMaster: Waiting for spark context 
 initialization
 15/07/30 12:13:33 INFO yarn.ApplicationMaster: Waiting for spark context 
 initialization ...
 15/07/30 12:13:33 INFO spark.SparkContext: Running Spark version 1.4.1
 15/07/30 12:13:33 WARN spark.SparkConf:
 SPARK_JAVA_OPTS was detected (set to '-Dspark.driver.port=53411').
 This is deprecated in Spark 1.0+.

 Please instead use:
  - ./spark-submit with conf/spark-defaults.conf to set defaults for an 
 application
  - ./spark-submit with --driver-java-options to set -X options for a driver
  - spark.executor.extraJavaOptions to set -X options for executors
  - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master 
 or worker)

 15/07/30 12:13:33 WARN spark.SparkConf: Setting 
 'spark.executor.extraJavaOptions' to '-Dspark.driver.port=53411' as a 
 work-around.
 15/07/30 12:13:33 WARN spark.SparkConf: Setting 
 'spark.driver.extraJavaOptions' to '-Dspark.driver.port=53411' as a 
 work-around.
 15/07/30 12:13:33 INFO spark.SecurityManager: Changing view acls to: hadoop
 15/07/30 12:13:33 INFO spark.SecurityManager: Changing modify acls to: hadoop
 15/07/30 12:13:33 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(hadoop); users 
 with modify permissions: Set(hadoop)
 15/07/30 12:13:33 INFO slf4j.Slf4jLogger: Slf4jLogger started
 15/07/30 12:13:33 INFO Remoting: Starting remoting
 15/07/30 12:13:34 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkDriver@10.21.1.77:53411]
 15/07/30 12:13:34 INFO util.Utils: Successfully started service 'sparkDriver' 
 on port 53411.
 15/07/30 12:13:34 INFO spark.SparkEnv: Registering MapOutputTracker
 15/07/30 12:13:34 INFO spark.SparkEnv: Registering BlockManagerMaster
 15/07/30 12:13:34 INFO storage.DiskBlockManager: Created local directory at 
 /home/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1438090734187_0010/blockmgr-2166bbd9-b1ed-41d1-bc95-92c6a7fbd36f
 15/07/30 12:13:34 INFO storage.MemoryStore: MemoryStore started with capacity 
 246.0 MB
 15/07/30 12:13:34 INFO spark.HttpFileServer: HTTP File server directory is 
 /home/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1438090734187_0010/httpd-d1232310-5aa1-44e7-a99a-cc2ae614f89c
 15/07/30 12:13:34 INFO spark.HttpServer: Starting HTTP Server
 15/07/30 12:13:34 INFO server.Server: jetty-8.y.z-SNAPSHOT
 15/07/30 12:13:34 INFO server.AbstractConnector: Started 
 SocketConnector@0.0.0.0:52507
 15/07/30 12:13:34 INFO util.Utils: Successfully started service 'HTTP file 
 server' on port 52507.
 15/07/30 12:13:34 INFO spark.SparkEnv: Registering OutputCommitCoordinator
 15/07/30 12:13:34 INFO ui.JettyUtils: Adding filter: 
 org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
 15/07/30 12:13:34 INFO server.Server: jetty-8.y.z-SNAPSHOT
 15/07/30 12:13:34 INFO server.AbstractConnector: Started 
 SelectChannelConnector@0.0.0.0:59596
 15/07/30 12:13:34 INFO util.Utils: Successfully started service 'SparkUI' on 
 port 59596.
 15/07/30 12:13:34 INFO ui.SparkUI: Started SparkUI at http://10.21.1.77:59596
 15/07/30 12:13:34 INFO cluster.YarnClusterScheduler: Created 
 YarnClusterScheduler
 15/07/30

Spark on YARN

2015-07-30 Thread Jeetendra Gangele

I am running below command this is default spark PI program but this is not
running all the log are going in stderr but at the terminal job is
succeeding .I guess there are con issue job it not at all launching

/bin/spark-submit --class org.apache.spark.examples.SparkPi --master
yarn-cluster lib/spark-examples-1.4.1-hadoop2.6.0.jar 10


Complete log

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/home/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/23/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/opt/hadoop-2.7.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/07/30 12:13:31 INFO yarn.ApplicationMaster: Registered signal
handlers for [TERM, HUP, INT]
15/07/30 12:13:32 INFO yarn.ApplicationMaster: ApplicationAttemptId:
appattempt_1438090734187_0010_01
15/07/30 12:13:33 INFO spark.SecurityManager: Changing view acls to: hadoop
15/07/30 12:13:33 INFO spark.SecurityManager: Changing modify acls to: hadoop
15/07/30 12:13:33 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view
permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/07/30 12:13:33 INFO yarn.ApplicationMaster: Starting the user
application in a separate Thread
15/07/30 12:13:33 INFO yarn.ApplicationMaster: Waiting for spark
context initialization
15/07/30 12:13:33 INFO yarn.ApplicationMaster: Waiting for spark
context initialization ...
15/07/30 12:13:33 INFO spark.SparkContext: Running Spark version 1.4.1
15/07/30 12:13:33 WARN spark.SparkConf:
SPARK_JAVA_OPTS was detected (set to '-Dspark.driver.port=53411').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with conf/spark-defaults.conf to set defaults for an
application
 - ./spark-submit with --driver-java-options to set -X options for a driver
 - spark.executor.extraJavaOptions to set -X options for executors
 - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons
(master or worker)

15/07/30 12:13:33 WARN spark.SparkConf: Setting
'spark.executor.extraJavaOptions' to '-Dspark.driver.port=53411' as a
work-around.
15/07/30 12:13:33 WARN spark.SparkConf: Setting
'spark.driver.extraJavaOptions' to '-Dspark.driver.port=53411' as a
work-around.
15/07/30 12:13:33 INFO spark.SecurityManager: Changing view acls to: hadoop
15/07/30 12:13:33 INFO spark.SecurityManager: Changing modify acls to: hadoop
15/07/30 12:13:33 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view
permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/07/30 12:13:33 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/07/30 12:13:33 INFO Remoting: Starting remoting
15/07/30 12:13:34 INFO Remoting: Remoting started; listening on
addresses :[akka.tcp://sparkDriver@10.21.1.77:53411]
15/07/30 12:13:34 INFO util.Utils: Successfully started service
'sparkDriver' on port 53411.
15/07/30 12:13:34 INFO spark.SparkEnv: Registering MapOutputTracker
15/07/30 12:13:34 INFO spark.SparkEnv: Registering BlockManagerMaster
15/07/30 12:13:34 INFO storage.DiskBlockManager: Created local
directory at 
/home/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1438090734187_0010/blockmgr-2166bbd9-b1ed-41d1-bc95-92c6a7fbd36f
15/07/30 12:13:34 INFO storage.MemoryStore: MemoryStore started with
capacity 246.0 MB
15/07/30 12:13:34 INFO spark.HttpFileServer: HTTP File server
directory is 
/home/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1438090734187_0010/httpd-d1232310-5aa1-44e7-a99a-cc2ae614f89c
15/07/30 12:13:34 INFO spark.HttpServer: Starting HTTP Server
15/07/30 12:13:34 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/07/30 12:13:34 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:52507
15/07/30 12:13:34 INFO util.Utils: Successfully started service 'HTTP
file server' on port 52507.
15/07/30 12:13:34 INFO spark.SparkEnv: Registering OutputCommitCoordinator
15/07/30 12:13:34 INFO ui.JettyUtils: Adding filter:
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/07/30 12:13:34 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/07/30 12:13:34 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:59596
15/07/30 12:13:34 INFO util.Utils: Successfully started service
'SparkUI' on port 59596.
15/07/30 12:13:34 INFO ui.SparkUI: Started SparkUI at http://10.21.1.77:59596
15/07/30 12:13:34 INFO cluster.YarnClusterScheduler: Created
YarnClusterScheduler
15/07/30 12:13:34 INFO util.Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port
43354.
15/07/30 12:13:34 INFO netty.NettyBlockTransferService: Server created on 43354
15/07/30 12:13:34 INFO storage.BlockManagerMaster: Trying to register

Re: Data from PostgreSQL to Spark

2015-07-28 Thread Jeetendra Gangele

Hi Ayan Thanks for reply.
Its around 5 GB having 10 tables...this data changes very frequently every
minutes few updates
its difficult to have this data in spark, if any updates happen on main
tables, how can I refresh spark data?





On 28 July 2015 at 02:11, ayan guha guha.a...@gmail.com wrote:

 You can call dB connect once per partition. Please have a look at design
 patterns of for each construct in document.
 How big is your data in dB? How soon that data changes? You would be
 better off if data is in spark already
 On 28 Jul 2015 04:48, Jeetendra Gangele gangele...@gmail.com wrote:

 Thanks for your reply.

 Parallel i will be hitting around 6000 call to postgreSQl which is not
 good my database will die.
 these calls to database will keeps on increasing.
 Handling millions on request is not an issue with Hbase/NOSQL

 any other alternative?




 On 27 July 2015 at 23:18, felixcheun...@hotmail.com wrote:

 You can have Spark reading from PostgreSQL through the data access API.
 Do you have any concern with that approach since you mention copying that
 data into HBase.

 From: Jeetendra Gangele
 Sent: Monday, July 27, 6:00 AM
 Subject: Data from PostgreSQL to Spark
 To: user

 Hi All

 I have a use case where where I am consuming the Events from RabbitMQ
 using spark streaming.This event has some fields on which I want to query
 the PostgreSQL and bring the data and then do the join between event data
 and PostgreSQl data and put the aggregated data into HDFS, so that I run
 run analytics query over this data using SparkSQL.

 my question is PostgreSQL data in production data so i don't want to hit
 so many times.

 at any given  1 seconds time I may have 3000 events,that means I need to
 fire 3000 parallel query to my PostGreSQl and this data keeps on growing,
 so my database will go down.



 I can't migrate this PostgreSQL data since lots of system using it,but I
 can take this data to some NOSQL like base and query the Hbase, but here
 issue is How can I make sure that Hbase has upto date data?

 Any anyone suggest me best approach/ method to handle this case?

 Regards

 Jeetendra

Re: Data from PostgreSQL to Spark

2015-07-28 Thread Jeetendra Gangele

I trying do that, but there will always data mismatch, since by the time
scoop is fetching main database will get so many updates. There is
something called incremental data fetch using scoop but that hits a
database rather than reading the WAL edit.



On 28 July 2015 at 02:52, santosh...@gmail.com wrote:

  Why cant you bulk pre-fetch the data to HDFS (like using Sqoop) instead
 of hitting Postgres multiple times?

 Sent from Windows Mail

 *From:* ayan guha guha.a...@gmail.com
 *Sent:* ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎4‎:‎41‎ ‎PM
 *To:* Jeetendra Gangele gangele...@gmail.com
 *Cc:* felixcheun...@hotmail.com, user@spark.apache.org

 You can call dB connect once per partition. Please have a look at design
 patterns of for each construct in document.
 How big is your data in dB? How soon that data changes? You would be
 better off if data is in spark already
 On 28 Jul 2015 04:48, Jeetendra Gangele gangele...@gmail.com wrote:

 Thanks for your reply.

 Parallel i will be hitting around 6000 call to postgreSQl which is not
 good my database will die.
 these calls to database will keeps on increasing.
 Handling millions on request is not an issue with Hbase/NOSQL

 any other alternative?




 On 27 July 2015 at 23:18, felixcheun...@hotmail.com wrote:

 You can have Spark reading from PostgreSQL through the data access API.
 Do you have any concern with that approach since you mention copying that
 data into HBase.

 From: Jeetendra Gangele
 Sent: Monday, July 27, 6:00 AM
 Subject: Data from PostgreSQL to Spark
 To: user

 Hi All

 I have a use case where where I am consuming the Events from RabbitMQ
 using spark streaming.This event has some fields on which I want to query
 the PostgreSQL and bring the data and then do the join between event data
 and PostgreSQl data and put the aggregated data into HDFS, so that I run
 run analytics query over this data using SparkSQL.

 my question is PostgreSQL data in production data so i don't want to hit
 so many times.

 at any given  1 seconds time I may have 3000 events,that means I need to
 fire 3000 parallel query to my PostGreSQl and this data keeps on growing,
 so my database will go down.



 I can't migrate this PostgreSQL data since lots of system using it,but I
 can take this data to some NOSQL like base and query the Hbase, but here
 issue is How can I make sure that Hbase has upto date data?

 Any anyone suggest me best approach/ method to handle this case?

 Regards

 Jeetendra

Re: Data from PostgreSQL to Spark

2015-07-28 Thread Jeetendra Gangele

can the source write to Kafka/Flume/Hbase in addition to Postgres? no
it can't write ,this is due to the fact that there are many applications
those are producing this postGreSql data.I can't really asked all the teams
to start writing to some other source.


velocity of the application is too high.






On 28 July 2015 at 21:50, santosh...@gmail.com wrote:

  Sqoop’s incremental data fetch will reduce the data size you need to
 pull from source, but then by the time that incremental data fetch is
 complete, is it not current again, if velocity of the data is high?

 May be you can put a trigger in Postgres to send data to the big data
 cluster as soon as changes are made. Or as I was saying in another email,
 can the source write to Kafka/Flume/Hbase in addition to Postgres?

 Sent from Windows Mail

 *From:* Jeetendra Gangele gangele...@gmail.com
 *Sent:* ‎Tuesday‎, ‎July‎ ‎28‎, ‎2015 ‎5‎:‎43‎ ‎AM
 *To:* santosh...@gmail.com
 *Cc:* ayan guha guha.a...@gmail.com, felixcheun...@hotmail.com,
 user@spark.apache.org

 I trying do that, but there will always data mismatch, since by the time
 scoop is fetching main database will get so many updates. There is
 something called incremental data fetch using scoop but that hits a
 database rather than reading the WAL edit.



 On 28 July 2015 at 02:52, santosh...@gmail.com wrote:

  Why cant you bulk pre-fetch the data to HDFS (like using Sqoop) instead
 of hitting Postgres multiple times?

 Sent from Windows Mail

 *From:* ayan guha guha.a...@gmail.com
 *Sent:* ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎4‎:‎41‎ ‎PM
 *To:* Jeetendra Gangele gangele...@gmail.com
 *Cc:* felixcheun...@hotmail.com, user@spark.apache.org

 You can call dB connect once per partition. Please have a look at design
 patterns of for each construct in document.
 How big is your data in dB? How soon that data changes? You would be
 better off if data is in spark already
 On 28 Jul 2015 04:48, Jeetendra Gangele gangele...@gmail.com wrote:

 Thanks for your reply.

 Parallel i will be hitting around 6000 call to postgreSQl which is not
 good my database will die.
 these calls to database will keeps on increasing.
 Handling millions on request is not an issue with Hbase/NOSQL

 any other alternative?




 On 27 July 2015 at 23:18, felixcheun...@hotmail.com wrote:

 You can have Spark reading from PostgreSQL through the data access API.
 Do you have any concern with that approach since you mention copying that
 data into HBase.

 From: Jeetendra Gangele
 Sent: Monday, July 27, 6:00 AM
 Subject: Data from PostgreSQL to Spark
 To: user

 Hi All

 I have a use case where where I am consuming the Events from RabbitMQ
 using spark streaming.This event has some fields on which I want to query
 the PostgreSQL and bring the data and then do the join between event data
 and PostgreSQl data and put the aggregated data into HDFS, so that I run
 run analytics query over this data using SparkSQL.

 my question is PostgreSQL data in production data so i don't want to
 hit so many times.

 at any given  1 seconds time I may have 3000 events,that means I need
 to fire 3000 parallel query to my PostGreSQl and this data keeps on
 growing, so my database will go down.



 I can't migrate this PostgreSQL data since lots of system using it,but
 I can take this data to some NOSQL like base and query the Hbase, but here
 issue is How can I make sure that Hbase has upto date data?

 Any anyone suggest me best approach/ method to handle this case?

 Regards

 Jeetendra

Data from PostgreSQL to Spark

2015-07-27 Thread Jeetendra Gangele

Hi All

I have a use case where where I am consuming the Events from RabbitMQ using
spark streaming.This event has some fields on which I want to query the
PostgreSQL and bring the data and then do the join between event data and
PostgreSQl data and put the aggregated data into HDFS, so that I run run
analytics query over this data using SparkSQL.

my question is PostgreSQL data in production data so i don't want to hit so
many times.

at any given  1 seconds time I may have 3000 events,that means I need to
fire 3000 parallel query to my PostGreSQl and this data keeps on growing,
so my database will go down.

I can't migrate this PostgreSQL data since lots of system using it,but I
can take this data to some NOSQL like base and query the Hbase, but here
issue is How can I make sure that Hbase has upto date data?

Any anyone suggest me best approach/ method to handle this case?


Regards
Jeetendra

Re: Data from PostgreSQL to Spark

2015-07-27 Thread Jeetendra Gangele

Thanks for your reply.

Parallel i will be hitting around 6000 call to postgreSQl which is not good
my database will die.
these calls to database will keeps on increasing.
Handling millions on request is not an issue with Hbase/NOSQL

any other alternative?




On 27 July 2015 at 23:18, felixcheun...@hotmail.com wrote:

 You can have Spark reading from PostgreSQL through the data access API. Do
 you have any concern with that approach since you mention copying that data
 into HBase.

 From: Jeetendra Gangele
 Sent: Monday, July 27, 6:00 AM
 Subject: Data from PostgreSQL to Spark
 To: user

 Hi All

 I have a use case where where I am consuming the Events from RabbitMQ
 using spark streaming.This event has some fields on which I want to query
 the PostgreSQL and bring the data and then do the join between event data
 and PostgreSQl data and put the aggregated data into HDFS, so that I run
 run analytics query over this data using SparkSQL.

 my question is PostgreSQL data in production data so i don't want to hit
 so many times.

 at any given  1 seconds time I may have 3000 events,that means I need to
 fire 3000 parallel query to my PostGreSQl and this data keeps on growing,
 so my database will go down.



 I can't migrate this PostgreSQL data since lots of system using it,but I
 can take this data to some NOSQL like base and query the Hbase, but here
 issue is How can I make sure that Hbase has upto date data?

 Any anyone suggest me best approach/ method to handle this case?

 Regards

 Jeetendra

getting Error while Running SparkPi program

2015-07-24 Thread Jeetendra Gangele

while running below getting the error un yarn log can anybody hit this issue

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
yarn-cluster lib/spark-examples-1.4.1-hadoop2.6.0.jar 10


2015-07-24 12:06:10,846 ERROR [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: Error
communicating with RM: Resource Manager doesn't recognize AttemptId:
appattempt_1437676824552_0002_01
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Resource
Manager doesn't recognize AttemptId:
appattempt_1437676824552_0002_01

Re: Re: Need help in setting up spark cluster

2015-07-23 Thread Jeetendra Gangele

Thanks for reply and your valuable suggestions

I have 10 GB data generated every day so this data I need to write in my
database also this data is schema base and schema changes frequently , so
consider this as unstructured data sometimes I may have to serve 1
write/secs with 4 m1.xLarge machine so using spark SQL with hive thrift
server will be good enough?
As per my understanding spark Sql works on schemaRDD will there not be any
problem when schema changes?

Also I have complex queries for real time analytics something like AND
queries involved multiple field queries like

list all user who bought flats in mumbai in last 30 minutes


if I use Hbase/Cassandra i need to set up the NOSQL cluster so now two
cluster one for spark and another one for NOSQl,so its not better to start
with HDP?




On 23 July 2015 at 11:33, fightf...@163.com fightf...@163.com wrote:

 Hi, there

 Per for your analytical and real time recommendations request, I would
 recommend you use spark sql and hive thriftserver

 to store and process your spark streaming data. As thriftserver would be
 run as a long-term application and it would be

 quite feasible to cyclely comsume data and provide some analytical
 requitements.

 On the other hand, hbase or cassandra would also be sufficient and I think
 you may want to integrate spark sql with hbase / cassandra

 for your data digesting.  You could deploy a CDH or HDP platform to
 support your productive environment running. I suggest you

 firstly to deploy a spark standalone cluster to run some integration
 tests, and also you can consider running spark on yarn for

 the later development use cases.

 Best,
 Sun.

 --
 fightf...@163.com


 *From:* Jeetendra Gangele gangele...@gmail.com
 *Date:* 2015-07-23 13:39
 *To:* user user@spark.apache.org
 *Subject:* Re: Need help in setting up spark cluster
 Can anybody help here?

 On 22 July 2015 at 10:38, Jeetendra Gangele gangele...@gmail.com wrote:

 Hi All,

 I am trying to capture the user activities for real estate portal.

 I am using RabbitMS and Spark streaming combination where all the Events
 I am pushing to RabbitMQ and then 1 secs micro job I am consuming using
 Spark streaming.

 Later on I am thinking to store the consumed data for analytics or near
 real time recommendations.

 Where should I store this data in Spark RDD itself and using SparkSQL
 people can query this data for analytics or real time recommendations, this
 data is not huge currently its 10 GB per day.

 Another alternatiove will be either Hbase or Cassandra, which one will be
 better?

 Any suggestions?


 Also for this use cases should I use any existing big data platform like
 hortonworks or I can deploy standalone spark cluster ?

Need help in SparkSQL

2015-07-22 Thread Jeetendra Gangele

HI All,

I have data in MongoDb(few TBs) which I want to migrate to HDFS to do
complex queries analysis on this data.Queries like AND queries involved
multiple fields

So my question in which which format I should store the data in HDFS so
that processing will be fast for such kind of queries?


Regards
Jeetendra

Re: Need help in setting up spark cluster

2015-07-22 Thread Jeetendra Gangele

Can anybody help here?

On 22 July 2015 at 10:38, Jeetendra Gangele gangele...@gmail.com wrote:

 Hi All,

 I am trying to capture the user activities for real estate portal.

 I am using RabbitMS and Spark streaming combination where all the Events I
 am pushing to RabbitMQ and then 1 secs micro job I am consuming using Spark
 streaming.

 Later on I am thinking to store the consumed data for analytics or near
 real time recommendations.

 Where should I store this data in Spark RDD itself and using SparkSQL
 people can query this data for analytics or real time recommendations, this
 data is not huge currently its 10 GB per day.

 Another alternatiove will be either Hbase or Cassandra, which one will be
 better?

 Any suggestions?


 Also for this use cases should I use any existing big data platform like
 hortonworks or I can deploy standalone spark cluster ?

Re: Need help in SparkSQL

2015-07-22 Thread Jeetendra Gangele

Query will be something like that

1. how many users visited 1 BHK flat in last 1 hour in given particular area
2. how many visitor for flats in give area
3. list all user who bought given property in last 30 days

Further it may go too complex involving multiple parameters in my query.

The problem is HBase is designing row key to get this data efficiently.

Since I have multiple fields to query upon base may not be a good choice?

i dont dont to iterate the result set which Hbase returns and give the
result because this will kill the performance?

On 23 July 2015 at 01:02, Jörn Franke jornfra...@gmail.com wrote:

 Can you provide an example of an and query ? If you do just look-up you
 should try Hbase/ phoenix, otherwise you can try orc with storage index
 and/or compression, but this depends on how your queries look like

 Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele gangele...@gmail.com a
 écrit :

 HI All,

 I have data in MongoDb(few TBs) which I want to migrate to HDFS to do
 complex queries analysis on this data.Queries like AND queries involved
 multiple fields

 So my question in which which format I should store the data in HDFS so
 that processing will be fast for such kind of queries?


 Regards
 Jeetendra




-- 
Hi,

Find my attached resume. I have total around 7 years of work experience.
I worked for Amazon and Expedia in my previous assignments and currently I
am working with start- up technology company called Insideview in hyderabad.

Regards
Jeetendra

Does Spark streaming support is there with RabbitMQ

2015-07-20 Thread Jeetendra Gangele

Does Apache spark support RabbitMQ. I have messages on RabbitMQ and I want
to process them using Apache Spark streaming does it scale?

Regards
Jeetendra

Re: Does Spark streaming support is there with RabbitMQ

2015-07-20 Thread Jeetendra Gangele

Thanks Todd,

I m not sure whether somebody has used it or not. can somebody confirm if
this integrate nicely with Spark streaming?


On 20 July 2015 at 18:43, Todd Nist tsind...@gmail.com wrote:

 There is one package available on the spark-packages site,

 http://spark-packages.org/package/Stratio/RabbitMQ-Receiver

 The source is here:

 https://github.com/Stratio/RabbitMQ-Receiver

 Not sure that meets your needs or not.

 -Todd

 On Mon, Jul 20, 2015 at 8:52 AM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Does Apache spark support RabbitMQ. I have messages on RabbitMQ and I
 want to process them using Apache Spark streaming does it scale?

 Regards
 Jeetendra

Re: Create multiple rows from elements in array on a single row

2015-06-08 Thread Jeetendra Gangele

mapTopair that time you can break the key.

On 8 June 2015 at 23:27, Bill Q bill.q@gmail.com wrote:

 Hi,
 I have a rdd with the following structure:
 row1: key: Seq[a, b]; value: value 1
 row2: key: seq[a, c, f]; value: value 2

 Is there an efficient way to de-flat the rows into?
 row1: key: a; value: value1
 row2: key: a; value: value2
 row3: key: b; value: value1
 row4: key: c; value: value2
 row5: key: f; value: value2


 Many thanks.


 Bill

Re: Error in using saveAsParquetFile

2015-06-08 Thread Jeetendra Gangele

Parquet file when are you loading these file?
can you please share the code where you are passing parquet file to spark?.

On 8 June 2015 at 16:39, Cheng Lian lian.cs@gmail.com wrote:

 Are you appending the joined DataFrame whose PolicyType is string to an
 existing Parquet file whose PolicyType is int? The exception indicates that
 Parquet found a column with conflicting data types.

 Cheng


 On 6/8/15 5:29 PM, bipin wrote:

 Hi I get this error message when saving a table:

 parquet.io.ParquetDecodingException: The requested schema is not
 compatible
 with the file schema. incompatible types: optional binary PolicyType
 (UTF8)
 != optional int32 PolicyType
 at

 parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:105)
 at

 parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:97)
 at parquet.schema.PrimitiveType.accept(PrimitiveType.java:386)
 at

 parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:87)
 at

 parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:61)
 at parquet.schema.MessageType.accept(MessageType.java:55)
 at
 parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:148)
 at
 parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:137)
 at
 parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:157)
 at

 parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:107)
 at

 parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:94)
 at
 parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64)
 at

 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
 at

 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
 at
 org.apache.spark.sql.parquet.ParquetRelation2.org
 $apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:667)
 at

 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
 at

 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 I joined two tables both loaded from parquet file, the joined table when
 saved throws this error. I could not find anything about this error. Could
 this be a bug ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Error-in-using-saveAsParquetFile-tp23204.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Hi,

Find my attached resume. I have total around 7 years of work experience.
I worked for Amazon and Expedia in my previous assignments and currently I
am working with start- up technology company called Insideview in hyderabad.

Regards
Jeetendra

Re: path to hdfs

2015-06-08 Thread Jeetendra Gangele

your HDFS path to spark job is incorrect.

On 8 June 2015 at 16:24, Nirmal Fernando nir...@wso2.com wrote:

 HDFS path should be something like; hdfs://
 127.0.0.1:8020/user/cloudera/inputs/

 On Mon, Jun 8, 2015 at 4:15 PM, Pa Rö paul.roewer1...@googlemail.com
 wrote:

 hello,

 i submit my spark job with the following parameters:

 ./spark-1.1.0-bin-hadoop2.4/bin/spark-submit \
   --class mgm.tp.bigdata.ma_spark.SparkMain \
   --master spark://quickstart.cloudera:7077 \
   ma-spark.jar \
   1000

 and get the following exception:

 java.io.IOException: Mkdirs failed to create file:/
 127.0.0.1:8020/user/cloudera/outputs/output_spark
 at
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:438)
 at
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784)
 at mgm.tp.bigdata.ma_spark.Helper.writeCenterHistory(Helper.java:35)
 at mgm.tp.bigdata.ma_spark.SparkMain.main(SparkMain.java:100)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative
 path in absolute URI: 127.0.0.1:8020
 at org.apache.hadoop.fs.Path.initialize(Path.java:206)
 at org.apache.hadoop.fs.Path.init(Path.java:172)
 at org.apache.hadoop.fs.Path.init(Path.java:94)
 at org.apache.hadoop.fs.Globber.glob(Globber.java:211)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642)
 at
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at
 org.apache.spark.rdd.FilteredRDD.getPartitions(FilteredRDD.scala:29)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1135)
 at org.apache.spark.rdd.RDD.count(RDD.scala:904)
 at org.apache.spark.rdd.RDD.takeSample(RDD.scala:401)
 at
 org.apache.spark.api.java.JavaRDDLike$class.takeSample(JavaRDDLike.scala:426)
 at org.apache.spark.api.java.JavaRDD.takeSample(JavaRDD.scala:32)
 at
 org.apache.spark.api.java.JavaRDDLike$class.takeSample(JavaRDDLike.scala:422)
 at org.apache.spark.api.java.JavaRDD.takeSample(JavaRDD.scala:32)
 at mgm.tp.bigdata.ma_spark.SparkMain.kmeans(SparkMain.java:123)
 at mgm.tp.bigdata.ma_spark.SparkMain.main(SparkMain.java:102)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.net.URISyntaxException: Relative path in absolute URI:
 127.0.0.1:8020
 at java.net.URI.checkPath(URI.java:1804)

not getting any mail

2015-05-02 Thread Jeetendra Gangele

Hi All

I am not getting any mail from this community?

Re: Enabling Event Log

2015-05-02 Thread Jeetendra Gangele

is it working now?

On 1 May 2015 at 13:43, James King jakwebin...@gmail.com wrote:

 Oops! well spotted. Many thanks Shixiong.

 On Fri, May 1, 2015 at 1:25 AM, Shixiong Zhu zsxw...@gmail.com wrote:

 spark.history.fs.logDirectory is for the history server. For Spark
 applications, they should use spark.eventLog.dir. Since you commented out
 spark.eventLog.dir, it will be /tmp/spark-events. And this folder does
 not exits.

 Best Regards,
 Shixiong Zhu

 2015-04-29 23:22 GMT-07:00 James King jakwebin...@gmail.com:

 I'm unclear why I'm getting this exception.

 It seems to have realized that I want to enable  Event Logging but
 ignoring where I want it to log to i.e. file:/opt/cb/tmp/spark-events which
 does exist.

 spark-default.conf

 # Example:
 spark.master spark://master1:7077,master2:7077
 spark.eventLog.enabled   true
 spark.history.fs.logDirectoryfile:/opt/cb/tmp/spark-events
 # spark.eventLog.dir   hdfs://namenode:8021/directory
 # spark.serializer
 org.apache.spark.serializer.KryoSerializer
 # spark.driver.memory  5g
 # spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value
 -Dnumbers=one two three

 Exception following job submission:

 spark.eventLog.enabled=true
 spark.history.fs.logDirectory=file:/opt/cb/tmp/spark-events

 spark.jars=file:/opt/cb/scripts/spark-streamer/cb-spark-streamer-1.0-SNAPSHOT.jar
 spark.master=spark://master1:7077,master2:7077
 Exception in thread main java.lang.IllegalArgumentException: Log
 directory /tmp/spark-events does not exist.
 at
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:99)
 at org.apache.spark.SparkContext.init(SparkContext.scala:399)
 at
 org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:642)
 at
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:75)
 at
 org.apache.spark.streaming.api.java.JavaStreamingContext.init(JavaStreamingContext.scala:132)


 Many Thanks
 jk

Re: MLib KMeans on large dataset issues

2015-04-29 Thread Jeetendra Gangele

How you are passing feature vector to K means?
its in 2-D space of 1-D array?

Did you try using Streaming Kmeans?

will you be able to paste code here?

On 29 April 2015 at 17:23, Sam Stoelinga sammiest...@gmail.com wrote:

 Hi Sparkers,

 I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a
 large K but I've encountered the following issues:


- Spark driver gets out of memory and dies because collect gets called
as part of KMeans, which loads all data back to the driver's memory.
- At the end there is a LocalKMeans class which runs KMeansPlusPlus on
the Spark driver. Why isn't this distributed? It's spending a long time on
here and this has the same problem as point 1 requires loading the data to
the driver.
Also when LocakKMeans is running on driver also seeing lots of :
15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus
initialization ran out of distinct points for centers. Using duplicate
point for center k = 222
- Has the above behaviour been like this in previous releases? I
remember running KMeans before without too much problems.

 Looking forward to hear you point out my stupidity or provide work-arounds
 that could make Spark KMeans work well on large datasets.

 Regards,
 Sam Stoelinga

solr in spark

2015-04-28 Thread Jeetendra Gangele

Does anyone tried using solr inside spark?
below is the project describing it.
https://github.com/LucidWorks/spark-solr.

I have a requirement in which I want to index 20 millions companies name
and then search as and when new data comes in. the output should be list of
companies matching the query.

Spark has inbuilt elastic search but for this purpose Elastic search is not
a good option since this is totally text search problem?

Elastic search is good  for filtering and grouping.

Does any body used solr inside spark?

Regards
jeetendra

Re: solr in spark

2015-04-28 Thread Jeetendra Gangele

Thanks for reply.

Elastic search index will be within my Cluster? or I need the separate host
the elastic search?


On 28 April 2015 at 22:03, Nick Pentreath nick.pentre...@gmail.com wrote:

 I haven't used Solr for a long time, and haven't used Solr in Spark.

 However, why do you say Elasticsearch is not a good option ...? ES
 absolutely supports full-text search and not just filtering and grouping
 (in fact it's original purpose was and still is text search, though
 filtering, grouping and aggregation are heavily used).
 http://www.elastic.co/guide/en/elasticsearch/guide/master/full-text-search.html



 On Tue, Apr 28, 2015 at 6:27 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Does anyone tried using solr inside spark?
 below is the project describing it.
 https://github.com/LucidWorks/spark-solr.

 I have a requirement in which I want to index 20 millions companies name
 and then search as and when new data comes in. the output should be list of
 companies matching the query.

 Spark has inbuilt elastic search but for this purpose Elastic search is
 not a good option since this is totally text search problem?

 Elastic search is good  for filtering and grouping.

 Does any body used solr inside spark?

 Regards
 jeetendra

Re: directory loader in windows

2015-04-25 Thread Jeetendra Gangele

loc = D:\\Project\\Spark\\code\\news\\jsonfeeds\\

On 25 April 2015 at 20:49, Jeetendra Gangele gangele...@gmail.com wrote:

 Hi Ayan can you try below line

 loc = D:\\Project\\Spark\\code\\news\\jsonfeeds

 On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote:

 Hi

 I am facing this weird issue.

 I am on Windows, and I am trying to load all files within a folder. Here
 is my code -

 loc = D:\\Project\\Spark\\code\\news\\jsonfeeds
 newsY = sc.textFile(loc)
 print newsY.count()

 Even this simple code fails. I have tried with giving exact file names,
 everything works.

 Am I missing something stupid here? Anyone facing this (anyone still use
 windows?:))

 Here is error trace:

 D:\Project\Spark\code\news\jsonfeeds

 Traceback (most recent call last):
   File D:/Project/Spark/code/newsfeeder.py, line 28, in module
 print newsY.count()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 932, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 923, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 739, in reduce
 vals = self.mapPartitions(func).collect()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 713, in collect
 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   File C:\Python27\lib\site-packages\py4j\java_gateway.py, line 537, in
 __call__
 self.target_id, self.name)
   File C:\Python27\lib\site-packages\py4j\protocol.py, line 300, in
 get_return_value
 format(target_id, '.', name), value)
 Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : java.lang.NullPointerException

 at java.lang.ProcessBuilder.start(Unknown Source)

 at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)

 at org.apache.hadoop.util.Shell.run(Shell.java:455)

 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)

 at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557)

 at
 org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681)

 at
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)

 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)

 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)

 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)

 at org.apache.spark.rdd.RDD.collect(RDD.scala:813)

 at
 org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374)

 at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

 at java.lang.reflect.Method.invoke(Unknown Source)

 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

 at py4j.Gateway.invoke(Gateway.java:259)

 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133

Re: directory loader in windows

2015-04-25 Thread Jeetendra Gangele

Hi Ayan can you try below line

loc = D:\\Project\\Spark\\code\\news\\jsonfeeds

On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote:

 Hi

 I am facing this weird issue.

 I am on Windows, and I am trying to load all files within a folder. Here
 is my code -

 loc = D:\\Project\\Spark\\code\\news\\jsonfeeds
 newsY = sc.textFile(loc)
 print newsY.count()

 Even this simple code fails. I have tried with giving exact file names,
 everything works.

 Am I missing something stupid here? Anyone facing this (anyone still use
 windows?:))

 Here is error trace:

 D:\Project\Spark\code\news\jsonfeeds

 Traceback (most recent call last):
   File D:/Project/Spark/code/newsfeeder.py, line 28, in module
 print newsY.count()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 932, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 923, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 739, in reduce
 vals = self.mapPartitions(func).collect()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 713, in collect
 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   File C:\Python27\lib\site-packages\py4j\java_gateway.py, line 537, in
 __call__
 self.target_id, self.name)
   File C:\Python27\lib\site-packages\py4j\protocol.py, line 300, in
 get_return_value
 format(target_id, '.', name), value)
 Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : java.lang.NullPointerException

 at java.lang.ProcessBuilder.start(Unknown Source)

 at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)

 at org.apache.hadoop.util.Shell.run(Shell.java:455)

 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)

 at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557)

 at org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681)

 at
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)

 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)

 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)

 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)

 at org.apache.spark.rdd.RDD.collect(RDD.scala:813)

 at
 org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374)

 at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

 at java.lang.reflect.Method.invoke(Unknown Source)

 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

 at py4j.Gateway.invoke(Gateway.java:259)

 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

 at py4j.commands.CallCommand.execute(CallCommand.java:79)

 at py4j.GatewayConnection.run(GatewayConnection.java:207)

 at

Re: directory loader in windows

2015-04-25 Thread Jeetendra Gangele

extra forward slash at the end. sometime I have seen this kind of issues

On 25 April 2015 at 20:50, Jeetendra Gangele gangele...@gmail.com wrote:

 loc = D:\\Project\\Spark\\code\\news\\jsonfeeds\\

 On 25 April 2015 at 20:49, Jeetendra Gangele gangele...@gmail.com wrote:

 Hi Ayan can you try below line

 loc = D:\\Project\\Spark\\code\\news\\jsonfeeds

 On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote:

 Hi

 I am facing this weird issue.

 I am on Windows, and I am trying to load all files within a folder. Here
 is my code -

 loc = D:\\Project\\Spark\\code\\news\\jsonfeeds
 newsY = sc.textFile(loc)
 print newsY.count()

 Even this simple code fails. I have tried with giving exact file names,
 everything works.

 Am I missing something stupid here? Anyone facing this (anyone still use
 windows?:))

 Here is error trace:

 D:\Project\Spark\code\news\jsonfeeds

 Traceback (most recent call last):
   File D:/Project/Spark/code/newsfeeder.py, line 28, in module
 print newsY.count()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 932, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 923, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 739, in reduce
 vals = self.mapPartitions(func).collect()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 713, in collect
 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   File C:\Python27\lib\site-packages\py4j\java_gateway.py, line 537,
 in __call__
 self.target_id, self.name)
   File C:\Python27\lib\site-packages\py4j\protocol.py, line 300, in
 get_return_value
 format(target_id, '.', name), value)
 Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : java.lang.NullPointerException

 at java.lang.ProcessBuilder.start(Unknown Source)

 at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)

 at org.apache.hadoop.util.Shell.run(Shell.java:455)

 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)

 at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557)

 at
 org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681)

 at
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)

 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)

 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)

 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at
 org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)

 at org.apache.spark.rdd.RDD.collect(RDD.scala:813)

 at
 org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374)

 at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

 at java.lang.reflect.Method.invoke(Unknown Source)

 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

 at py4j.reflection.ReflectionEngine.invoke

Re: directory loader in windows

2015-04-25 Thread Jeetendra Gangele

also if this code is in scala why not val in newsY? is this define above?
loc = D:\\Project\\Spark\\code\\news\\jsonfeeds
newsY = sc.textFile(loc)
print newsY.count()

On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote:

 Hi

 I am facing this weird issue.

 I am on Windows, and I am trying to load all files within a folder. Here
 is my code -

 loc = D:\\Project\\Spark\\code\\news\\jsonfeeds
 newsY = sc.textFile(loc)
 print newsY.count()

 Even this simple code fails. I have tried with giving exact file names,
 everything works.

 Am I missing something stupid here? Anyone facing this (anyone still use
 windows?:))

 Here is error trace:

 D:\Project\Spark\code\news\jsonfeeds

 Traceback (most recent call last):
   File D:/Project/Spark/code/newsfeeder.py, line 28, in module
 print newsY.count()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 932, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 923, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 739, in reduce
 vals = self.mapPartitions(func).collect()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 713, in collect
 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   File C:\Python27\lib\site-packages\py4j\java_gateway.py, line 537, in
 __call__
 self.target_id, self.name)
   File C:\Python27\lib\site-packages\py4j\protocol.py, line 300, in
 get_return_value
 format(target_id, '.', name), value)
 Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : java.lang.NullPointerException

 at java.lang.ProcessBuilder.start(Unknown Source)

 at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)

 at org.apache.hadoop.util.Shell.run(Shell.java:455)

 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)

 at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557)

 at org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681)

 at
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)

 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)

 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)

 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)

 at org.apache.spark.rdd.RDD.collect(RDD.scala:813)

 at
 org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374)

 at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

 at java.lang.reflect.Method.invoke(Unknown Source)

 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

 at py4j.Gateway.invoke(Gateway.java:259)

 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

 at

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Jeetendra Gangele

zipwithIndex will preserve the order whatever is there in your val lines.
I am not sure about the val lines=sc.textFile(hdfs://mytextFile)  if
this line maintain the order, next will maintain for sure



On 24 April 2015 at 18:35, Spico Florin spicoflo...@gmail.com wrote:

 Hello!
   I know that HadoopRDD partitions are built based on the number of splits
 in HDFS. I'm wondering if these partitions preserve the initial order of
 data in file.
 As an example, if I have an HDFS (myTextFile) file that has these splits:

 split 0- line 1, ..., line k
 split 1-line k+1,..., line k+n
 splt 2-line k+n, line k+n+m

 and the code
 val lines=sc.textFile(hdfs://mytextFile)
 lines.zipWithIndex()

 will the order of lines preserved?
 (line 1, zipIndex 1) , .. (line k, zipIndex k), and so one.

 I found this question on stackoverflow (
 http://stackoverflow.com/questions/26046410/how-can-i-obtain-an-element-position-in-sparks-rdd)
 whose answer intrigued me:
 Essentially, RDD's zipWithIndex() method seems to do this, but it won't
 preserve the original ordering of the data the RDD was created from

 Can you please confirm that is this the correct answer?

 Thanks.
  Florin

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Jeetendra Gangele

Thanks that's why I was worried and tested my application again :).

On 24 April 2015 at 23:22, Michal Michalski michal.michal...@boxever.com
wrote:

 Yes.

 Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com

 On 24 April 2015 at 17:12, Jeetendra Gangele gangele...@gmail.com wrote:

 you used ZipWithUniqueID?

 On 24 April 2015 at 21:28, Michal Michalski michal.michal...@boxever.com
  wrote:

 I somehow missed zipWithIndex (and Sean's email), thanks for hint. I
 mean - I saw it before, but I just thought it's not doing what I want. I've
 re-read the description now and it looks like it might be actually what I
 need. Thanks.

 Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com

 On 24 April 2015 at 16:26, Ganelin, Ilya ilya.gane...@capitalone.com
 wrote:

  To maintain the order you can use zipWithIndex as Sean Owen pointed
 out. This is the same as zipWithUniqueId except the assigned number is the
 index of the data in the RDD which I believe matches the order of data as
 it's stored on HDFS.



 Sent with Good (www.good.com)


 -Original Message-
 *From: *Michal Michalski [michal.michal...@boxever.com]
 *Sent: *Friday, April 24, 2015 11:18 AM Eastern Standard Time
 *To: *Ganelin, Ilya
 *Cc: *Spico Florin; user
 *Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order
 of the input data from Hadoop?

 I read it one by one as I need to maintain the order, but it doesn't
 mean that I process them one by one later. Input lines refer to different
 entities I update, so once I read them in order, I group them by the id of
 the entity I want to update, sort the updates on per-entity basis and
 process them further in parallel (including writing data to C* and Kafka at
 the very end). That's what I use Spark for - the first step I ask about is
 just a requirement related to the input format I get and need to support.
 Everything what happens after that is just a normal data processing job
 that you want to distribute.

  Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com

 On 24 April 2015 at 16:10, Ganelin, Ilya ilya.gane...@capitalone.com
 wrote:

 If you're reading a file one by line then you should simply use Java's
 Hadoop FileSystem class to read the file with a BuffereInputStream. I 
 don't
 think you need an RDD here.



 Sent with Good (www.good.com)


 -Original Message-
 *From: *Michal Michalski [michal.michal...@boxever.com]
  *Sent: *Friday, April 24, 2015 11:04 AM Eastern Standard Time
 *To: *Ganelin, Ilya
 *Cc: *Spico Florin; user
 *Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order
 of the input data from Hadoop?

 The problem I'm facing is that I need to process lines from input file
 in the order they're stored in the file, as they define the order of
 updates I need to apply on some data and these updates are not commutative
 so that order matters. Unfortunately the input is purely order-based,
 theres no timestamp per line etc. in the file and I'd prefer to avoid
 preparing the file in advance by adding ordinals before / after each line.
 From the approaches you suggested first two won't work as there's nothing 
 I
 could sort by. I'm not sure about the third one - I'm just not sure what
 you meant there to be honest :-)

  Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com

 On 24 April 2015 at 15:48, Ganelin, Ilya ilya.gane...@capitalone.com
 wrote:

 Michael - you need to sort your RDD. Check out the shuffle
 documentation on the Spark Programming Guide. It talks about this
 specifically. You can resolve this in a couple of ways - either by
 collecting your RDD and sorting it, using sortBy, or not worrying about 
 the
 internal ordering. You can still extract elements in order by using a
 filter with the zip if e.g RDD.filter(s = s._2  50).sortBy(_._1)



 Sent with Good (www.good.com)



 -Original Message-
 *From: *Michal Michalski [michal.michal...@boxever.com]
 *Sent: *Friday, April 24, 2015 10:41 AM Eastern Standard Time
 *To: *Spico Florin
 *Cc: *user
 *Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order
 of the input data from Hadoop?

 Of course after you do it, you probably want to call
 repartition(somevalue) on your RDD to get your paralellism back.

  Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com

 On 24 April 2015 at 15:28, Michal Michalski 
 michal.michal...@boxever.com wrote:

 I did a quick test as I was curious about it too. I created a file
 with numbers from 0 to 999, in order, line by line. Then I did:

 scala val numbers = sc.textFile(./numbers.txt)
 scala val zipped = numbers.zipWithUniqueId
 scala zipped.foreach(i = println(i))

 Expected result if the order was preserved would be something like:
 (0, 0), (1, 1) etc.
 Unfortunately, the output looks like this:

  (126,1)
 (223,2)
 (320,3)
 (1,0)
 (127,11)
 (2,10)
  (...)

 The workaround I found that works for me for my specific use case
 (relatively small input files) is setting

Re: regarding ZipWithIndex

2015-04-24 Thread Jeetendra Gangele

Anyone who can guide me how to reduce the Size from Long to Int since I
dont need Long index.
I am huge data and this index talking 8 bytes, if i can reduce it to 4
bytes will be great help?

On 22 April 2015 at 22:46, Jeetendra Gangele gangele...@gmail.com wrote:

 Sure thanks. if you can guide me how to do this will be great help.

 On 17 April 2015 at 22:05, Ted Yu yuzhih...@gmail.com wrote:

 I have some assignments on hand at the moment.

 Will try to come up with sample code after I clear the assignments.

 FYI

 On Thu, Apr 16, 2015 at 2:00 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Can you please guide me how can I extend RDD and convert into this way
 you are suggesting.

 On 16 April 2015 at 23:46, Jeetendra Gangele gangele...@gmail.com
 wrote:

 I type T i already have Object ... I have RDDObject and then I am
 calling ZipWithIndex on this RDD and getting RDDObject,Long on this I am
 running MapToPair and converting into RDDLong,Object so that i can use it
 later for other operation like lookup and join.


 On 16 April 2015 at 23:42, Ted Yu yuzhih...@gmail.com wrote:

 The Long in RDD[(T, Long)] is type parameter. You can create RDD with
 Integer as the first type parameter.

 Cheers

 On Thu, Apr 16, 2015 at 11:07 AM, Jeetendra Gangele 
 gangele...@gmail.com wrote:

 Hi Ted.
 This works for me. But since Long takes here 8 bytes. Can I reduce it
 to 4 bytes. its just a index and I feel 4 bytes was more than
 enough.is there any method which takes Integer or similar for Index?


 On 13 April 2015 at 01:59, Ted Yu yuzhih...@gmail.com wrote:

 bq. will return something like JavaPairRDDObject, long

 The long component of the pair fits your description of index. What
 other requirement does ZipWithIndex not provide you ?

 Cheers

 On Sun, Apr 12, 2015 at 1:16 PM, Jeetendra Gangele 
 gangele...@gmail.com wrote:

 Hi All I have an RDD JavaRDDObject and I want to convert it to
 JavaPairRDDIndex,Object.. Index should be unique and it should 
 maintain
 the order. For first object It should have 1 and then for second 2 like
 that.

 I tried using ZipWithIndex but it will return something like
 JavaPairRDDObject, long
 I wanted to use this RDD for lookup and join operation later in my
 workflow so ordering is important.


 Regards
 jeet

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Jeetendra Gangele

you used ZipWithUniqueID?

On 24 April 2015 at 21:28, Michal Michalski michal.michal...@boxever.com
wrote:

 I somehow missed zipWithIndex (and Sean's email), thanks for hint. I mean
 - I saw it before, but I just thought it's not doing what I want. I've
 re-read the description now and it looks like it might be actually what I
 need. Thanks.

 Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com

 On 24 April 2015 at 16:26, Ganelin, Ilya ilya.gane...@capitalone.com
 wrote:

  To maintain the order you can use zipWithIndex as Sean Owen pointed out.
 This is the same as zipWithUniqueId except the assigned number is the index
 of the data in the RDD which I believe matches the order of data as it's
 stored on HDFS.



 Sent with Good (www.good.com)


 -Original Message-
 *From: *Michal Michalski [michal.michal...@boxever.com]
 *Sent: *Friday, April 24, 2015 11:18 AM Eastern Standard Time
 *To: *Ganelin, Ilya
 *Cc: *Spico Florin; user
 *Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of
 the input data from Hadoop?

 I read it one by one as I need to maintain the order, but it doesn't mean
 that I process them one by one later. Input lines refer to different
 entities I update, so once I read them in order, I group them by the id of
 the entity I want to update, sort the updates on per-entity basis and
 process them further in parallel (including writing data to C* and Kafka at
 the very end). That's what I use Spark for - the first step I ask about is
 just a requirement related to the input format I get and need to support.
 Everything what happens after that is just a normal data processing job
 that you want to distribute.

  Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com

 On 24 April 2015 at 16:10, Ganelin, Ilya ilya.gane...@capitalone.com
 wrote:

 If you're reading a file one by line then you should simply use Java's
 Hadoop FileSystem class to read the file with a BuffereInputStream. I don't
 think you need an RDD here.



 Sent with Good (www.good.com)


 -Original Message-
 *From: *Michal Michalski [michal.michal...@boxever.com]
  *Sent: *Friday, April 24, 2015 11:04 AM Eastern Standard Time
 *To: *Ganelin, Ilya
 *Cc: *Spico Florin; user
 *Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of
 the input data from Hadoop?

 The problem I'm facing is that I need to process lines from input file
 in the order they're stored in the file, as they define the order of
 updates I need to apply on some data and these updates are not commutative
 so that order matters. Unfortunately the input is purely order-based,
 theres no timestamp per line etc. in the file and I'd prefer to avoid
 preparing the file in advance by adding ordinals before / after each line.
 From the approaches you suggested first two won't work as there's nothing I
 could sort by. I'm not sure about the third one - I'm just not sure what
 you meant there to be honest :-)

  Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com

 On 24 April 2015 at 15:48, Ganelin, Ilya ilya.gane...@capitalone.com
 wrote:

 Michael - you need to sort your RDD. Check out the shuffle
 documentation on the Spark Programming Guide. It talks about this
 specifically. You can resolve this in a couple of ways - either by
 collecting your RDD and sorting it, using sortBy, or not worrying about the
 internal ordering. You can still extract elements in order by using a
 filter with the zip if e.g RDD.filter(s = s._2  50).sortBy(_._1)



 Sent with Good (www.good.com)



 -Original Message-
 *From: *Michal Michalski [michal.michal...@boxever.com]
 *Sent: *Friday, April 24, 2015 10:41 AM Eastern Standard Time
 *To: *Spico Florin
 *Cc: *user
 *Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order
 of the input data from Hadoop?

 Of course after you do it, you probably want to call
 repartition(somevalue) on your RDD to get your paralellism back.

  Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com

 On 24 April 2015 at 15:28, Michal Michalski 
 michal.michal...@boxever.com wrote:

 I did a quick test as I was curious about it too. I created a file
 with numbers from 0 to 999, in order, line by line. Then I did:

 scala val numbers = sc.textFile(./numbers.txt)
 scala val zipped = numbers.zipWithUniqueId
 scala zipped.foreach(i = println(i))

 Expected result if the order was preserved would be something like:
 (0, 0), (1, 1) etc.
 Unfortunately, the output looks like this:

  (126,1)
 (223,2)
 (320,3)
 (1,0)
 (127,11)
 (2,10)
  (...)

 The workaround I found that works for me for my specific use case
 (relatively small input files) is setting explicitly the number of
 partitions to 1 when reading a single *text* file:

 scala val numbers_sp = sc.textFile(./numbers.txt, 1)

 Than the output is exactly as I would expect.

 I didn't dive into the code too much, but I took a very quick look at
 it and figured out - correct me if I missed something,

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Jeetendra Gangele

I have an RDDObject which I get from Hbase scan using newAPIHadoopRDD. I
am running here ZipWithIndex and its preserving the order. first object got
1 second got 2 third got 3 and so on nth object got n.




On 24 April 2015 at 20:56, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:

  To maintain the order you can use zipWithIndex as Sean Owen pointed out.
 This is the same as zipWithUniqueId except the assigned number is the index
 of the data in the RDD which I believe matches the order of data as it's
 stored on HDFS.



 Sent with Good (www.good.com)


 -Original Message-
 *From: *Michal Michalski [michal.michal...@boxever.com]
 *Sent: *Friday, April 24, 2015 11:18 AM Eastern Standard Time
 *To: *Ganelin, Ilya
 *Cc: *Spico Florin; user
 *Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of
 the input data from Hadoop?

 I read it one by one as I need to maintain the order, but it doesn't mean
 that I process them one by one later. Input lines refer to different
 entities I update, so once I read them in order, I group them by the id of
 the entity I want to update, sort the updates on per-entity basis and
 process them further in parallel (including writing data to C* and Kafka at
 the very end). That's what I use Spark for - the first step I ask about is
 just a requirement related to the input format I get and need to support.
 Everything what happens after that is just a normal data processing job
 that you want to distribute.

  Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com

 On 24 April 2015 at 16:10, Ganelin, Ilya ilya.gane...@capitalone.com
 wrote:

 If you're reading a file one by line then you should simply use Java's
 Hadoop FileSystem class to read the file with a BuffereInputStream. I don't
 think you need an RDD here.



 Sent with Good (www.good.com)


 -Original Message-
 *From: *Michal Michalski [michal.michal...@boxever.com]
  *Sent: *Friday, April 24, 2015 11:04 AM Eastern Standard Time
 *To: *Ganelin, Ilya
 *Cc: *Spico Florin; user
 *Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of
 the input data from Hadoop?

 The problem I'm facing is that I need to process lines from input file in
 the order they're stored in the file, as they define the order of updates I
 need to apply on some data and these updates are not commutative so that
 order matters. Unfortunately the input is purely order-based, theres no
 timestamp per line etc. in the file and I'd prefer to avoid preparing the
 file in advance by adding ordinals before / after each line. From the
 approaches you suggested first two won't work as there's nothing I could
 sort by. I'm not sure about the third one - I'm just not sure what you
 meant there to be honest :-)

  Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com

 On 24 April 2015 at 15:48, Ganelin, Ilya ilya.gane...@capitalone.com
 wrote:

 Michael - you need to sort your RDD. Check out the shuffle documentation
 on the Spark Programming Guide. It talks about this specifically. You can
 resolve this in a couple of ways - either by collecting your RDD and
 sorting it, using sortBy, or not worrying about the internal ordering. You
 can still extract elements in order by using a filter with the zip if e.g
 RDD.filter(s = s._2  50).sortBy(_._1)



 Sent with Good (www.good.com)



 -Original Message-
 *From: *Michal Michalski [michal.michal...@boxever.com]
 *Sent: *Friday, April 24, 2015 10:41 AM Eastern Standard Time
 *To: *Spico Florin
 *Cc: *user
 *Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of
 the input data from Hadoop?

 Of course after you do it, you probably want to call
 repartition(somevalue) on your RDD to get your paralellism back.

  Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com

 On 24 April 2015 at 15:28, Michal Michalski 
 michal.michal...@boxever.com wrote:

 I did a quick test as I was curious about it too. I created a file with
 numbers from 0 to 999, in order, line by line. Then I did:

 scala val numbers = sc.textFile(./numbers.txt)
 scala val zipped = numbers.zipWithUniqueId
 scala zipped.foreach(i = println(i))

 Expected result if the order was preserved would be something like: (0,
 0), (1, 1) etc.
 Unfortunately, the output looks like this:

  (126,1)
 (223,2)
 (320,3)
 (1,0)
 (127,11)
 (2,10)
  (...)

 The workaround I found that works for me for my specific use case
 (relatively small input files) is setting explicitly the number of
 partitions to 1 when reading a single *text* file:

 scala val numbers_sp = sc.textFile(./numbers.txt, 1)

 Than the output is exactly as I would expect.

 I didn't dive into the code too much, but I took a very quick look at
 it and figured out - correct me if I missed something, it's Friday
 afternoon! ;-)  - that this workaround will work fine for all the input
 formats inheriting from org.apache.hadoop.mapred.FileInputFormat including
 TextInputFormat, of course - see the

Re: Tasks run only on one machine

2015-04-23 Thread Jeetendra Gangele

Will you be able to paste code here?

On 23 April 2015 at 22:21, Pat Ferrel p...@occamsmachete.com wrote:

 Using Spark streaming to create a large volume of small nano-batch input
 files, ~4k per file, thousands of 'part-x' files.  When reading the
 nano-batch files and doing a distributed calculation my tasks run only on
 the machine where it was launched. I'm launching in yarn-client mode. The
 rdd is created using sc.textFile(list of thousand files)

 What would cause the read to occur only on the machine that launched the
 driver.

 Do I need to do something to the RDD after reading? Has some partition
 factor been applied to all derived rdds?
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Distinct is very slow

2015-04-23 Thread Jeetendra Gangele

Anyone any thought on this?

On 22 April 2015 at 22:49, Jeetendra Gangele gangele...@gmail.com wrote:

 I made 7000 tasks in mapTopair and in distinct also I made same number of
 tasks.
 Still lots of shuffle read and write is happening due to application
 running for much longer time.
 Any idea?

 On 17 April 2015 at 11:55, Akhil Das ak...@sigmoidanalytics.com wrote:

 How many tasks are you seeing in your mapToPair stage? Is it 7000? then i
 suggest you giving a number similar/close to 7000 in your .distinct call,
 what is happening in your case is that, you are repartitioning your data to
 a smaller number (32) which would put a lot of load on processing i
 believe, you can try increasing it.

 Thanks
 Best Regards

 On Fri, Apr 17, 2015 at 1:48 AM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Akhil, any thought on this?

 On 16 April 2015 at 23:07, Jeetendra Gangele gangele...@gmail.com
 wrote:

 No I did not tried the partitioning below is the full code

 public static  void  matchAndMerge(JavaRDDVendorRecord
 matchRdd,JavaSparkContext jsc) throws IOException{
  long start = System.currentTimeMillis();
   JavaPairRDDLong, MatcherReleventData RddForMarch
 =matchRdd.zipWithIndex().mapToPair(new
 PairFunctionTuple2VendorRecord,Long, Long, MatcherReleventData() {

 @Override
 public Tuple2Long, MatcherReleventData call(Tuple2VendorRecord,
 Long t)
 throws Exception {
 MatcherReleventData matcherData = new MatcherReleventData();
 Tuple2Long, MatcherReleventData tuple = new Tuple2Long,
 MatcherReleventData(t._2,
 matcherData.convertVendorDataToMatcherData(t._1));
  return tuple;
 }

 }).cache();
  log.info(after index+RddForMarch.take(1));
  MapLong, MatcherReleventData tmp =RddForMarch.collectAsMap();
 MapLong, MatcherReleventData matchData = new HashMapLong,
 MatcherReleventData(tmp);
 final BroadcastMapLong, MatcherReleventData dataMatchGlobal =
 jsc.broadcast(matchData);

 JavaPairRDDLong,String blockingRdd = RddForMarch.flatMapValues(new
 FunctionMatcherReleventData, IterableString(){

 @Override
 public IterableString call(MatcherReleventData v1)
 throws Exception {
 ListString values = new ArrayListString();
 HelperUtilities helper1 = new HelperUtilities();
 MatcherKeys matchkeys=helper1.getBlockinkeys(v1);
 if(matchkeys.get_companyName() !=null){
 values.add(matchkeys.get_companyName());
 }
 if(matchkeys.get_phoneNumberr() !=null){
 values.add(matchkeys.get_phoneNumberr());
 }
 if(matchkeys.get_zipCode() !=null){
 values.add(matchkeys.get_zipCode());
 }
 if(matchkeys.getM_domain() !=null){
 values.add(matchkeys.getM_domain());
 }
   return values;
 }
  });
  log.info(blocking RDD is+blockingRdd.count());
 int count=0;
 log.info(Starting printing);
   for (Tuple2Long, String entry : blockingRdd.collect()) {

   log.info(entry._1() + : + entry._2());
   count++;
 }
   log.info(total count+count);
  JavaPairRDDLong,Integer
 completeDataToprocess=blockingRdd.flatMapValues( new FunctionString,
 IterableInteger(){

 @Override
 public IterableInteger call(String v1) throws Exception {
 return ckdao.getSingelkeyresult(v1);
 }
  }).distinct(32);
  log.info(after hbase count is+completeDataToprocess.count());
  log.info(data for process+completeDataToprocess.take(1));
  JavaPairRDDLong, Tuple2Integer, Double withScore
 =completeDataToprocess.mapToPair( new PairFunctionTuple2Long,Integer,
 Long, Tuple2Integer, Double(){

 @Override
 public Tuple2Long, Tuple2Integer, Double call(Tuple2Long, Integer
 t)
 throws Exception {
 Scoring scoreObj = new Scoring();
 double score =scoreObj.computeMatchScore(companyDAO.get(t._2()),
 dataMatchGlobal.getValue().get(t._1()));
 Tuple2Integer, Double maptuple = new Tuple2Integer, Double(t._2(),
 score);
 Tuple2Long, Tuple2Integer, Double tuple = new Tuple2Long,
 Tuple2Integer,Double(t._1(), maptuple);
 return tuple;
 }
  });
  log.info(with score tuple is+withScore.take(1));
  JavaPairRDDLong, Tuple2Integer,Double maxScoreRDD
 =withScore.reduceByKey( new Function2Tuple2Integer,Double,
 Tuple2Integer,Double, Tuple2Integer,Double(){

 @Override
 public Tuple2Integer, Double call(Tuple2Integer, Double v1,
 Tuple2Integer, Double v2) throws Exception {
  int res =v1._2().compareTo(v2._2());
 if(res 0){
  Tuple2Integer, Double result = new Tuple2Integer, Double(v1._1(),
 v1._2());
 return result;
  }
 else if(res0){
 Tuple2Integer, Double result = new Tuple2Integer, Double(v2._1(),
 v2._2());
 return result;
 }
 else{
 Tuple2Integer, Double result = new Tuple2Integer, Double(v2._1(),
 v2._2());
 return result;
 }
   }
  });
  log.info(max score RDD+maxScoreRDD.take(10));

  maxScoreRDD.foreach( new
 VoidFunctionTuple2Long,Tuple2Integer,Double(){

 @Override
 public void call(Tuple2Long, Tuple2Integer, Double t)
 throws Exception {
 MatcherReleventData matchedData=dataMatchGlobal.getValue().get(t._1());
 log.info(broadcast is+dataMatchGlobal.getValue().get(t._1()));
 //Set the score for better understanding of merge
 matchedData.setScore(t._2()._2

Streaming Kmeans usage in java

2015-04-23 Thread Jeetendra Gangele

Do everyone do we have sample example how to use streaming k-means
clustering with java. I have seen some example usage in scala. can anybody
point me to the java example?

regards
jeetendra

Re: Clustering algorithms in Spark

2015-04-22 Thread Jeetendra Gangele

does anybody have any thought on this?

On 21 April 2015 at 20:57, Jeetendra Gangele gangele...@gmail.com wrote:

 The problem with k means is we have to define the no of cluster which I
 dont want in this case
 So thinking for something like hierarchical clustering any idea and
 suggestions?



 On 21 April 2015 at 20:51, Jeetendra Gangele gangele...@gmail.com wrote:

 I have a requirement in which I want to match the company name .. and I
 am thinking to solve this using clustering technique.

 Can anybody suggest which algo I should Use in Spark and how to evaluate
 the running time and accuracy for this particular problem.

 I checked K means looks good.
 Any idea suggestions?

Re: regarding ZipWithIndex

2015-04-22 Thread Jeetendra Gangele

Sure thanks. if you can guide me how to do this will be great help.

On 17 April 2015 at 22:05, Ted Yu yuzhih...@gmail.com wrote:

 I have some assignments on hand at the moment.

 Will try to come up with sample code after I clear the assignments.

 FYI

 On Thu, Apr 16, 2015 at 2:00 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Can you please guide me how can I extend RDD and convert into this way
 you are suggesting.

 On 16 April 2015 at 23:46, Jeetendra Gangele gangele...@gmail.com
 wrote:

 I type T i already have Object ... I have RDDObject and then I am
 calling ZipWithIndex on this RDD and getting RDDObject,Long on this I am
 running MapToPair and converting into RDDLong,Object so that i can use it
 later for other operation like lookup and join.


 On 16 April 2015 at 23:42, Ted Yu yuzhih...@gmail.com wrote:

 The Long in RDD[(T, Long)] is type parameter. You can create RDD with
 Integer as the first type parameter.

 Cheers

 On Thu, Apr 16, 2015 at 11:07 AM, Jeetendra Gangele 
 gangele...@gmail.com wrote:

 Hi Ted.
 This works for me. But since Long takes here 8 bytes. Can I reduce it
 to 4 bytes. its just a index and I feel 4 bytes was more than
 enough.is there any method which takes Integer or similar for Index?


 On 13 April 2015 at 01:59, Ted Yu yuzhih...@gmail.com wrote:

 bq. will return something like JavaPairRDDObject, long

 The long component of the pair fits your description of index. What
 other requirement does ZipWithIndex not provide you ?

 Cheers

 On Sun, Apr 12, 2015 at 1:16 PM, Jeetendra Gangele 
 gangele...@gmail.com wrote:

 Hi All I have an RDD JavaRDDObject and I want to convert it to
 JavaPairRDDIndex,Object.. Index should be unique and it should 
 maintain
 the order. For first object It should have 1 and then for second 2 like
 that.

 I tried using ZipWithIndex but it will return something like
 JavaPairRDDObject, long
 I wanted to use this RDD for lookup and join operation later in my
 workflow so ordering is important.


 Regards
 jeet

Re: ElasticSearch for Spark times out

2015-04-22 Thread Jeetendra Gangele

will you be able to paste the code?

On 23 April 2015 at 00:19, Adrian Mocanu amoc...@verticalscope.com wrote:

  Hi



 I use the ElasticSearch package for Spark and very often it times out
 reading data from ES into an RDD.

 How can I keep the connection alive (why doesn't it? Bug?)



 Here's the exception I get:

 org.elasticsearch.hadoop.serialization.EsHadoopSerializationException:
 java.net.SocketTimeoutException: Read timed out

 at
 org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:86)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.elasticsearch.hadoop.serialization.ParsingUtils.doSeekToken(ParsingUtils.java:70)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.elasticsearch.hadoop.serialization.ParsingUtils.seek(ParsingUtils.java:58)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:149)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:102)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:81)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:314)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 ~[scala-library.jar:na]

 at
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 ~[scala-library.jar:na]

 at
 scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
 ~[scala-library.jar:na]

 at
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 ~[scala-library.jar:na]

 at
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 ~[scala-library.jar:na]

 at
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 ~[scala-library.jar:na]

 at
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
 ~[spark-core_2.10-1.1.0.jar:1.1.0]

 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 ~[spark-core_2.10-1.1.0.jar:1.1.0]

 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 ~[spark-core_2.10-1.1.0.jar:1.1.0]

 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 ~[spark-core_2.10-1.1.0.jar:1.1.0]

 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 ~[spark-core_2.10-1.1.0.jar:1.1.0]

 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 [na:1.7.0_75]

 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 [na:1.7.0_75]

 at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]

 Caused by: java.net.SocketTimeoutException: Read timed out

 at java.net.SocketInputStream.socketRead0(Native Method)
 ~[na:1.7.0_75]

 at
 java.net.SocketInputStream.read(SocketInputStream.java:152) ~[na:1.7.0_75]

 at
 java.net.SocketInputStream.read(SocketInputStream.java:122) ~[na:1.7.0_75]

 at
 java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
 ~[na:1.7.0_75]

 at
 java.io.BufferedInputStream.read(BufferedInputStream.java:334)
 ~[na:1.7.0_75]

 at
 org.apache.commons.httpclient.WireLogInputStream.read(WireLogInputStream.java:69)
 ~[commons-httpclient-3.1.jar:na]

 at
 org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:170)
 ~[commons-httpclient-3.1.jar:na]

 at
 java.io.FilterInputStream.read(FilterInputStream.java:133) ~[na:1.7.0_75]

 at
 org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:108)
 ~[commons-httpclient-3.1.jar:na]

 at
 org.elasticsearch.hadoop.rest.DelegatingInputStream.read(DelegatingInputStream.java:57)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.codehaus.jackson.impl.Utf8StreamParser.loadMore(Utf8StreamParser.java:172)
 ~[jackson-core-asl-1.9.11.jar:1.9.11]

 at
 org.codehaus.jackson.impl.Utf8StreamParser.parseEscapedFieldName(Utf8StreamParser.java:1502)

Re: ElasticSearch for Spark times out

2015-04-22 Thread Jeetendra Gangele

Basically ready timeout means hat no data arrived within the specified
receive timeout period.
Few thing I would suggest
1.are your ES cluster Up and running?
2. if 1 is yes then reduce the size of the Index make it few kbps and then
test?

On 23 April 2015 at 00:19, Adrian Mocanu amoc...@verticalscope.com wrote:

  Hi



 I use the ElasticSearch package for Spark and very often it times out
 reading data from ES into an RDD.

 How can I keep the connection alive (why doesn't it? Bug?)



 Here's the exception I get:

 org.elasticsearch.hadoop.serialization.EsHadoopSerializationException:
 java.net.SocketTimeoutException: Read timed out

 at
 org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:86)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.elasticsearch.hadoop.serialization.ParsingUtils.doSeekToken(ParsingUtils.java:70)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.elasticsearch.hadoop.serialization.ParsingUtils.seek(ParsingUtils.java:58)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:149)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:102)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:81)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:314)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 ~[scala-library.jar:na]

 at
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 ~[scala-library.jar:na]

 at
 scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
 ~[scala-library.jar:na]

 at
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 ~[scala-library.jar:na]

 at
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 ~[scala-library.jar:na]

 at
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 ~[scala-library.jar:na]

 at
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
 ~[spark-core_2.10-1.1.0.jar:1.1.0]

 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 ~[spark-core_2.10-1.1.0.jar:1.1.0]

 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 ~[spark-core_2.10-1.1.0.jar:1.1.0]

 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 ~[spark-core_2.10-1.1.0.jar:1.1.0]

 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 ~[spark-core_2.10-1.1.0.jar:1.1.0]

 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 [na:1.7.0_75]

 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 [na:1.7.0_75]

 at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]

 Caused by: java.net.SocketTimeoutException: Read timed out

 at java.net.SocketInputStream.socketRead0(Native Method)
 ~[na:1.7.0_75]

 at
 java.net.SocketInputStream.read(SocketInputStream.java:152) ~[na:1.7.0_75]

 at
 java.net.SocketInputStream.read(SocketInputStream.java:122) ~[na:1.7.0_75]

 at
 java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
 ~[na:1.7.0_75]

 at
 java.io.BufferedInputStream.read(BufferedInputStream.java:334)
 ~[na:1.7.0_75]

 at
 org.apache.commons.httpclient.WireLogInputStream.read(WireLogInputStream.java:69)
 ~[commons-httpclient-3.1.jar:na]

 at
 org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:170)
 ~[commons-httpclient-3.1.jar:na]

 at
 java.io.FilterInputStream.read(FilterInputStream.java:133) ~[na:1.7.0_75]

 at
 org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:108)
 ~[commons-httpclient-3.1.jar:na]

 at
 org.elasticsearch.hadoop.rest.DelegatingInputStream.read(DelegatingInputStream.java:57)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at

Re: Clustering algorithms in Spark

2015-04-21 Thread Jeetendra Gangele

The problem with k means is we have to define the no of cluster which I
dont want in this case
So thinking for something like hierarchical clustering any idea and
suggestions?



On 21 April 2015 at 20:51, Jeetendra Gangele gangele...@gmail.com wrote:

 I have a requirement in which I want to match the company name .. and I am
 thinking to solve this using clustering technique.

 Can anybody suggest which algo I should Use in Spark and how to evaluate
 the running time and accuracy for this particular problem.

 I checked K means looks good.
 Any idea suggestions?

Spark SQL vs map reduce tableInputOutput

2015-04-20 Thread Jeetendra Gangele

HI All,

I am Querying Hbase and combining result and using in my spake job.
I am querying hbase using Hbase client api inside my spark job.
can anybody suggest me will Spark SQl will be fast enough and provide range
of queries?

Regards
Jeetendra

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-20 Thread Jeetendra Gangele

Write a crone job for this like below

12 * * * *  find $SPARK_HOME/work -cmin +1440 -prune -exec rm -rf {} \+
32 * * * *  find /tmp -type d -cmin +1440 -name spark-*-*-* -prune -exec
rm -rf {} \+
52 * * * *  find $SPARK_LOCAL_DIR -mindepth 1 -maxdepth 1 -type d -cmin
+1440 -name spark-*-*-* -prune -exec rm -rf {} \+

On 20 April 2015 at 23:12, N B nb.nos...@gmail.com wrote:

 Hi all,

 I had posed this query as part of a different thread but did not get a
 response there. So creating a new thread hoping to catch someone's
 attention.

 We are experiencing this issue of shuffle files being left behind and not
 being cleaned up by Spark. Since this is a Spark streaming application, it
 is expected to stay up indefinitely, so shuffle files not being cleaned up
 is a big problem right now. Our max window size is 6 hours, so we have set
 up a cron job to clean up shuffle files older than 12 hours otherwise it
 will eat up all our disk space.

 Please see the following. It seems the non-cleaning of shuffle files is
 being documented in 1.3.1.

 https://github.com/apache/spark/pull/5074/files
 https://issues.apache.org/jira/browse/SPARK-5836


 Also, for some reason, the following JIRAs that were reported as
 functional issues were closed as Duplicates of the above Documentation bug.
 Does this mean that this issue won't be tackled at all?

 https://issues.apache.org/jira/browse/SPARK-3563
 https://issues.apache.org/jira/browse/SPARK-4796
 https://issues.apache.org/jira/browse/SPARK-6011

 Any further insight into whether this is being looked into and meanwhile
 how to handle shuffle files will be greatly appreciated.

 Thanks
 NB

Re: Spark SQL vs map reduce tableInputOutput

2015-04-20 Thread Jeetendra Gangele

Thanks for reply.

Does phoenix using inside Spark will be useful?

what is the best way to bring data from Hbase into Spark in terms
performance of application?

Regards
Jeetendra

On 20 April 2015 at 20:49, Ted Yu yuzhih...@gmail.com wrote:

 To my knowledge, Spark SQL currently doesn't provide range scan capability
 against hbase.

 Cheers



  On Apr 20, 2015, at 7:54 AM, Jeetendra Gangele gangele...@gmail.com
 wrote:
 
  HI All,
 
  I am Querying Hbase and combining result and using in my spake job.
  I am querying hbase using Hbase client api inside my spark job.
  can anybody suggest me will Spark SQl will be fast enough and provide
 range of queries?
 
  Regards
  Jeetendra

Re: Custom partioner

2015-04-17 Thread Jeetendra Gangele

Hi Archit Thanks for reply.
How can I don the costom compilation so reduce it to 4 bytes.I want to make
it to 4 bytes in any case can you please guide?

I am applying flatMapvalue in each step after ZipWithIndex it should be in
same Node right? Why its suffling?
Also I am running with very less records currently still its shuffling ?

regards
jeetendra



On 17 April 2015 at 15:58, Archit Thakur archit279tha...@gmail.com wrote:

 I dont think you can change it to 4 bytes without any custom compilation.
 To make same key go to same node, you'll have to repartition the data,
 which is shuffling anyway. Unless your raw data is such that the same key
 is on same node, you'll have to shuffle atleast once to make same key on
 same node.

 On Thu, Apr 16, 2015 at 10:16 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Hi All

 I have a RDD which has 1 million keys and each key is repeated from
 around 7000 values so total there will be around 1M*7K records in RDD.

 and each key is created from ZipWithIndex so key start from 0 to M-1
 the problem with ZipWithIndex is it take long for key which is 8 bytes.
 can I reduce it to 4 bytes?

 Now how Can I make sure the record with same key will go the same node so
 that I can avoid shuffling. Also how default partition-er will work here.

 Regards
 jeetendra

Re: Distinct is very slow

2015-04-17 Thread Jeetendra Gangele

I am saying to partition something like partitionBy(new HashPartitioner(16)
will this not work?

On 17 April 2015 at 21:28, Jeetendra Gangele gangele...@gmail.com wrote:

 I have given 3000 task to mapToPair now its taking so much memory and
 shuffling and wasting time there. Here is the stats when I run with very
 small data almost for all data its doing shuffling not sure what is
 happening here any idea?


- *Total task time across all tasks: *11.0 h
- *Shuffle read: *153.8 MB
- *Shuffle write: *288.0 MB


 On 17 April 2015 at 14:32, Jeetendra Gangele gangele...@gmail.com wrote:

 mapToPair is running with 32 tasks but very slow because lot of shuffles
 read. attaching screen shot
 each task is running from 10 mins. even Though Inside function i m not
 doing anything costly.

Need Costom RDD

2015-04-17 Thread Jeetendra Gangele

Hi All

I have an RDDOjbect then I convert it to RDDObject,Long with
ZipWithIndex
here Index is Long and its taking 8 bytes Is there any way to make it
Integer?
There is no API available which INT index.
How Can I create Custom RDD so that I takes only 4 bytes for index part?

Also why API is design such a way that index of element it gives second
part of tuple


Regards
j

Custom partioner

2015-04-16 Thread Jeetendra Gangele

Hi All

I have a RDD which has 1 million keys and each key is repeated from around
7000 values so total there will be around 1M*7K records in RDD.

and each key is created from ZipWithIndex so key start from 0 to M-1
the problem with ZipWithIndex is it take long for key which is 8 bytes. can
I reduce it to 4 bytes?

Now how Can I make sure the record with same key will go the same node so
that I can avoid shuffling. Also how default partition-er will work here.

Regards
jeetendra

Distinct is very slow

2015-04-16 Thread Jeetendra Gangele

Hi All I have below code whether distinct is running for more time.

blockingRdd is the combination of Long,String and it will have 400K
records
JavaPairRDDLong,Integer completeDataToprocess=blockingRdd.flatMapValues(
new FunctionString, IterableInteger(){

@Override
public IterableInteger call(String v1) throws Exception {
return ckdao.getSingelkeyresult(v1);
}
 }).distinct(32);

I am running distinct on 800K records and its taking 2 hours on 16 cores
and 20 GB RAM.

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele

I already checked and G is taking 1 secs for each task. is this too much?
if yes how to avoid this?

On 16 April 2015 at 21:58, Akhil Das ak...@sigmoidanalytics.com wrote:

 Open the driver ui and see which stage is taking time, you can look
 whether its adding any GC time etc.

 Thanks
 Best Regards

 On Thu, Apr 16, 2015 at 9:56 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Hi All I have below code whether distinct is running for more time.

 blockingRdd is the combination of Long,String and it will have 400K
 records
 JavaPairRDDLong,Integer
 completeDataToprocess=blockingRdd.flatMapValues( new FunctionString,
 IterableInteger(){

 @Override
 public IterableInteger call(String v1) throws Exception {
 return ckdao.getSingelkeyresult(v1);
 }
  }).distinct(32);

 I am running distinct on 800K records and its taking 2 hours on 16 cores
 and 20 GB RAM.

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele

No I did not tried the partitioning below is the full code

public static  void  matchAndMerge(JavaRDDVendorRecord
matchRdd,JavaSparkContext jsc) throws IOException{
 long start = System.currentTimeMillis();
  JavaPairRDDLong, MatcherReleventData RddForMarch
=matchRdd.zipWithIndex().mapToPair(new
PairFunctionTuple2VendorRecord,Long, Long, MatcherReleventData() {

@Override
public Tuple2Long, MatcherReleventData call(Tuple2VendorRecord, Long t)
throws Exception {
MatcherReleventData matcherData = new MatcherReleventData();
Tuple2Long, MatcherReleventData tuple = new Tuple2Long,
MatcherReleventData(t._2,
matcherData.convertVendorDataToMatcherData(t._1));
 return tuple;
}

}).cache();
 log.info(after index+RddForMarch.take(1));
 MapLong, MatcherReleventData tmp =RddForMarch.collectAsMap();
MapLong, MatcherReleventData matchData = new HashMapLong,
MatcherReleventData(tmp);
final BroadcastMapLong, MatcherReleventData dataMatchGlobal =
jsc.broadcast(matchData);

JavaPairRDDLong,String blockingRdd = RddForMarch.flatMapValues(new
FunctionMatcherReleventData, IterableString(){

@Override
public IterableString call(MatcherReleventData v1)
throws Exception {
ListString values = new ArrayListString();
HelperUtilities helper1 = new HelperUtilities();
MatcherKeys matchkeys=helper1.getBlockinkeys(v1);
if(matchkeys.get_companyName() !=null){
values.add(matchkeys.get_companyName());
}
if(matchkeys.get_phoneNumberr() !=null){
values.add(matchkeys.get_phoneNumberr());
}
if(matchkeys.get_zipCode() !=null){
values.add(matchkeys.get_zipCode());
}
if(matchkeys.getM_domain() !=null){
values.add(matchkeys.getM_domain());
}
  return values;
}
 });
 log.info(blocking RDD is+blockingRdd.count());
int count=0;
log.info(Starting printing);
  for (Tuple2Long, String entry : blockingRdd.collect()) {

  log.info(entry._1() + : + entry._2());
  count++;
}
  log.info(total count+count);
 JavaPairRDDLong,Integer completeDataToprocess=blockingRdd.flatMapValues(
new FunctionString, IterableInteger(){

@Override
public IterableInteger call(String v1) throws Exception {
return ckdao.getSingelkeyresult(v1);
}
 }).distinct(32);
 log.info(after hbase count is+completeDataToprocess.count());
 log.info(data for process+completeDataToprocess.take(1));
 JavaPairRDDLong, Tuple2Integer, Double withScore
=completeDataToprocess.mapToPair( new PairFunctionTuple2Long,Integer,
Long, Tuple2Integer, Double(){

@Override
public Tuple2Long, Tuple2Integer, Double call(Tuple2Long, Integer t)
throws Exception {
Scoring scoreObj = new Scoring();
double score =scoreObj.computeMatchScore(companyDAO.get(t._2()),
dataMatchGlobal.getValue().get(t._1()));
Tuple2Integer, Double maptuple = new Tuple2Integer, Double(t._2(),
score);
Tuple2Long, Tuple2Integer, Double tuple = new Tuple2Long,
Tuple2Integer,Double(t._1(), maptuple);
return tuple;
}
 });
 log.info(with score tuple is+withScore.take(1));
 JavaPairRDDLong, Tuple2Integer,Double maxScoreRDD
=withScore.reduceByKey( new Function2Tuple2Integer,Double,
Tuple2Integer,Double, Tuple2Integer,Double(){

@Override
public Tuple2Integer, Double call(Tuple2Integer, Double v1,
Tuple2Integer, Double v2) throws Exception {
 int res =v1._2().compareTo(v2._2());
if(res 0){
 Tuple2Integer, Double result = new Tuple2Integer, Double(v1._1(),
v1._2());
return result;
 }
else if(res0){
Tuple2Integer, Double result = new Tuple2Integer, Double(v2._1(),
v2._2());
return result;
}
else{
Tuple2Integer, Double result = new Tuple2Integer, Double(v2._1(),
v2._2());
return result;
}
  }
 });
 log.info(max score RDD+maxScoreRDD.take(10));

 maxScoreRDD.foreach( new
VoidFunctionTuple2Long,Tuple2Integer,Double(){

@Override
public void call(Tuple2Long, Tuple2Integer, Double t)
throws Exception {
MatcherReleventData matchedData=dataMatchGlobal.getValue().get(t._1());
log.info(broadcast is+dataMatchGlobal.getValue().get(t._1()));
//Set the score for better understanding of merge
matchedData.setScore(t._2()._2());
vdDoa.updateMatchedRecordWithScore(matchedData, t._2()._1(),Souce_id);
 }
 });
 log.info(took  + (System.currentTimeMillis() - start) +  mills to run
matcher);



 }


On 16 April 2015 at 22:25, Akhil Das ak...@sigmoidanalytics.com wrote:

 Can you paste your complete code? Did you try repartioning/increasing
 level of parallelism to speed up the processing. Since you have 16 cores,
 and I'm assuming your 400k records isn't bigger than a 10G dataset.

 Thanks
 Best Regards

 On Thu, Apr 16, 2015 at 10:00 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 I already checked and G is taking 1 secs for each task. is this too much?
 if yes how to avoid this?


 On 16 April 2015 at 21:58, Akhil Das ak...@sigmoidanalytics.com wrote:

 Open the driver ui and see which stage is taking time, you can look
 whether its adding any GC time etc.

 Thanks
 Best Regards

 On Thu, Apr 16, 2015 at 9:56 PM, Jeetendra Gangele gangele...@gmail.com
  wrote:

 Hi All I have below code whether distinct is running for more time.

 blockingRdd is the combination

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele

Akhil, any thought on this?

On 16 April 2015 at 23:07, Jeetendra Gangele gangele...@gmail.com wrote:

 No I did not tried the partitioning below is the full code

 public static  void  matchAndMerge(JavaRDDVendorRecord
 matchRdd,JavaSparkContext jsc) throws IOException{
  long start = System.currentTimeMillis();
   JavaPairRDDLong, MatcherReleventData RddForMarch
 =matchRdd.zipWithIndex().mapToPair(new
 PairFunctionTuple2VendorRecord,Long, Long, MatcherReleventData() {

 @Override
 public Tuple2Long, MatcherReleventData call(Tuple2VendorRecord, Long t)
 throws Exception {
 MatcherReleventData matcherData = new MatcherReleventData();
 Tuple2Long, MatcherReleventData tuple = new Tuple2Long,
 MatcherReleventData(t._2,
 matcherData.convertVendorDataToMatcherData(t._1));
  return tuple;
 }

 }).cache();
  log.info(after index+RddForMarch.take(1));
  MapLong, MatcherReleventData tmp =RddForMarch.collectAsMap();
 MapLong, MatcherReleventData matchData = new HashMapLong,
 MatcherReleventData(tmp);
 final BroadcastMapLong, MatcherReleventData dataMatchGlobal =
 jsc.broadcast(matchData);

 JavaPairRDDLong,String blockingRdd = RddForMarch.flatMapValues(new
 FunctionMatcherReleventData, IterableString(){

 @Override
 public IterableString call(MatcherReleventData v1)
 throws Exception {
 ListString values = new ArrayListString();
 HelperUtilities helper1 = new HelperUtilities();
 MatcherKeys matchkeys=helper1.getBlockinkeys(v1);
 if(matchkeys.get_companyName() !=null){
 values.add(matchkeys.get_companyName());
 }
 if(matchkeys.get_phoneNumberr() !=null){
 values.add(matchkeys.get_phoneNumberr());
 }
 if(matchkeys.get_zipCode() !=null){
 values.add(matchkeys.get_zipCode());
 }
 if(matchkeys.getM_domain() !=null){
 values.add(matchkeys.getM_domain());
 }
   return values;
 }
  });
  log.info(blocking RDD is+blockingRdd.count());
 int count=0;
 log.info(Starting printing);
   for (Tuple2Long, String entry : blockingRdd.collect()) {

   log.info(entry._1() + : + entry._2());
   count++;
 }
   log.info(total count+count);
  JavaPairRDDLong,Integer
 completeDataToprocess=blockingRdd.flatMapValues( new FunctionString,
 IterableInteger(){

 @Override
 public IterableInteger call(String v1) throws Exception {
 return ckdao.getSingelkeyresult(v1);
 }
  }).distinct(32);
  log.info(after hbase count is+completeDataToprocess.count());
  log.info(data for process+completeDataToprocess.take(1));
  JavaPairRDDLong, Tuple2Integer, Double withScore
 =completeDataToprocess.mapToPair( new PairFunctionTuple2Long,Integer,
 Long, Tuple2Integer, Double(){

 @Override
 public Tuple2Long, Tuple2Integer, Double call(Tuple2Long, Integer t)
 throws Exception {
 Scoring scoreObj = new Scoring();
 double score =scoreObj.computeMatchScore(companyDAO.get(t._2()),
 dataMatchGlobal.getValue().get(t._1()));
 Tuple2Integer, Double maptuple = new Tuple2Integer, Double(t._2(),
 score);
 Tuple2Long, Tuple2Integer, Double tuple = new Tuple2Long,
 Tuple2Integer,Double(t._1(), maptuple);
 return tuple;
 }
  });
  log.info(with score tuple is+withScore.take(1));
  JavaPairRDDLong, Tuple2Integer,Double maxScoreRDD
 =withScore.reduceByKey( new Function2Tuple2Integer,Double,
 Tuple2Integer,Double, Tuple2Integer,Double(){

 @Override
 public Tuple2Integer, Double call(Tuple2Integer, Double v1,
 Tuple2Integer, Double v2) throws Exception {
  int res =v1._2().compareTo(v2._2());
 if(res 0){
  Tuple2Integer, Double result = new Tuple2Integer, Double(v1._1(),
 v1._2());
 return result;
  }
 else if(res0){
 Tuple2Integer, Double result = new Tuple2Integer, Double(v2._1(),
 v2._2());
 return result;
 }
 else{
 Tuple2Integer, Double result = new Tuple2Integer, Double(v2._1(),
 v2._2());
 return result;
 }
   }
  });
  log.info(max score RDD+maxScoreRDD.take(10));

  maxScoreRDD.foreach( new
 VoidFunctionTuple2Long,Tuple2Integer,Double(){

 @Override
 public void call(Tuple2Long, Tuple2Integer, Double t)
 throws Exception {
 MatcherReleventData matchedData=dataMatchGlobal.getValue().get(t._1());
 log.info(broadcast is+dataMatchGlobal.getValue().get(t._1()));
 //Set the score for better understanding of merge
 matchedData.setScore(t._2()._2());
 vdDoa.updateMatchedRecordWithScore(matchedData, t._2()._1(),Souce_id);
  }
  });
  log.info(took  + (System.currentTimeMillis() - start) +  mills to run
 matcher);



  }


 On 16 April 2015 at 22:25, Akhil Das ak...@sigmoidanalytics.com wrote:

 Can you paste your complete code? Did you try repartioning/increasing
 level of parallelism to speed up the processing. Since you have 16 cores,
 and I'm assuming your 400k records isn't bigger than a 10G dataset.

 Thanks
 Best Regards

 On Thu, Apr 16, 2015 at 10:00 PM, Jeetendra Gangele gangele...@gmail.com
  wrote:

 I already checked and G is taking 1 secs for each task. is this too
 much? if yes how to avoid this?


 On 16 April 2015 at 21:58, Akhil Das ak...@sigmoidanalytics.com wrote:

 Open the driver ui and see which stage is taking time, you can look
 whether its adding

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Jeetendra Gangele

Does this same functionality exist with Java?

On 17 April 2015 at 02:23, Evo Eftimov evo.efti...@isecc.com wrote:

 You can use

 def  partitionBy(partitioner: Partitioner): RDD[(K, V)]
 Return a copy of the RDD partitioned using the specified partitioner

 The https://github.com/amplab/spark-indexedrdd stuff looks pretty cool
 and is something which adds valuable functionality to spark e.g. the point
 lookups PROVIDED it can be executed from within function running on worker
 executors

 Can somebody from DataBricks sched more light here

 -Original Message-
 From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com]
 Sent: Thursday, April 16, 2015 9:39 PM
 To: user@spark.apache.org
 Subject: RE: How to join RDD keyValuePairs efficiently

 Evo

  partition the large doc RDD based on the hash function on the
 key ie the docid

 What API to use to do this?

 By the way, loading the entire dataset to memory cause OutOfMemory problem
 because it is too large (I only have one machine with 16GB and 4 cores).

 I found something called IndexedRDD on the web
 https://github.com/amplab/spark-indexedrdd

 Has anybody use it?

 Ningjun

 -Original Message-
 From: Evo Eftimov [mailto:evo.efti...@isecc.com]
 Sent: Thursday, April 16, 2015 12:18 PM
 To: 'Sean Owen'; Wang, Ningjun (LNG-NPV)
 Cc: user@spark.apache.org
 Subject: RE: How to join RDD keyValuePairs efficiently

 Ningjun, to speed up your current design you can do the following:

 1.partition the large doc RDD based on the hash function on the key ie the
 docid

 2. persist the large dataset in memory to be available for subsequent
 queries without reloading and repartitioning for every search query

 3. partition the small doc dataset in the same way - this will result in
 collocated small and large RDD partitions with the same key

 4. run the join - the match is not going to be sequential it is based on
 hash of the key moreover RDD elements with the same key will be collocated
 on the same cluster node


 OR simply go for Sean suggestion - under the hood it works in a slightly
 different way - the filter is executed in mappers running in parallel on
 every node and also by passing the small doc IDs to each filter (mapper)
 you essentially replicate them on every node so each mapper instance has
 its own copy and runs with it when filtering

 And finally you can prototype both options described above and measure and
 compare their performance

 -Original Message-
 From: Sean Owen [mailto:so...@cloudera.com]
 Sent: Thursday, April 16, 2015 5:02 PM
 To: Wang, Ningjun (LNG-NPV)
 Cc: user@spark.apache.org
 Subject: Re: How to join RDD keyValuePairs efficiently

 This would be much, much faster if your set of IDs was simply a Set, and
 you passed that to a filter() call that just filtered in the docs that
 matched an ID in the set.

 On Thu, Apr 16, 2015 at 4:51 PM, Wang, Ningjun (LNG-NPV) 
 ningjun.w...@lexisnexis.com wrote:
  Does anybody have a solution for this?
 
 
 
 
 
  From: Wang, Ningjun (LNG-NPV)
  Sent: Tuesday, April 14, 2015 10:41 AM
  To: user@spark.apache.org
  Subject: How to join RDD keyValuePairs efficiently
 
 
 
  I have an RDD that contains millions of Document objects. Each
  document has an unique Id that is a string. I need to find the documents
 by ids quickly.
  Currently I used RDD join as follow
 
 
 
  First I save the RDD as object file
 
 
 
  allDocs : RDD[Document] = getDocs()  // this RDD contains 7 million
  Document objects
 
  allDocs.saveAsObjectFile(/temp/allDocs.obj)
 
 
 
  Then I wrote a function to find documents by Ids
 
 
 
  def findDocumentsByIds(docids: RDD[String]) = {
 
  // docids contains less than 100 item
 
  val allDocs : RDD[Document] =sc.objectFile[Document](
  (/temp/allDocs.obj)
 
  val idAndDocs = allDocs.keyBy(d = dv.id)
 
  docids.map(id = (id,id)).join(idAndDocs).map(t = t._2._2)
 
  }
 
 
 
  I found that this is very slow. I suspect it scan the entire 7 million
  Document objects in /temp/allDocs.obj sequentially to find the
  desired document.
 
 
 
  Is there any efficient way to do this?
 
 
 
  One option I am thinking is that instead of storing the RDD[Document]
  as object file, I store each document in a separate file with filename
  equal to the docid. This way I can find a document quickly by docid.
  However this means I need to save the RDD to 7 million small file
  which will take a very long time to save and may cause IO problems with
 so many small files.
 
 
 
  Is there any other way?
 
 
 
 
 
 
 
  Ningjun

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org



 --
 T ususcib, -mil uerunubcrbesprkapch.og
 Fo adiioalcomads emal:usr...@sar.aace.rg



 -
 To unsubscribe, e-mail:

Re: regarding ZipWithIndex

2015-04-16 Thread Jeetendra Gangele

Can you please guide me how can I extend RDD and convert into this way you
are suggesting.

On 16 April 2015 at 23:46, Jeetendra Gangele gangele...@gmail.com wrote:

 I type T i already have Object ... I have RDDObject and then I am
 calling ZipWithIndex on this RDD and getting RDDObject,Long on this I am
 running MapToPair and converting into RDDLong,Object so that i can use it
 later for other operation like lookup and join.


 On 16 April 2015 at 23:42, Ted Yu yuzhih...@gmail.com wrote:

 The Long in RDD[(T, Long)] is type parameter. You can create RDD with
 Integer as the first type parameter.

 Cheers

 On Thu, Apr 16, 2015 at 11:07 AM, Jeetendra Gangele gangele...@gmail.com
  wrote:

 Hi Ted.
 This works for me. But since Long takes here 8 bytes. Can I reduce it to
 4 bytes. its just a index and I feel 4 bytes was more than enough.is
 there any method which takes Integer or similar for Index?


 On 13 April 2015 at 01:59, Ted Yu yuzhih...@gmail.com wrote:

 bq. will return something like JavaPairRDDObject, long

 The long component of the pair fits your description of index. What
 other requirement does ZipWithIndex not provide you ?

 Cheers

 On Sun, Apr 12, 2015 at 1:16 PM, Jeetendra Gangele 
 gangele...@gmail.com wrote:

 Hi All I have an RDD JavaRDDObject and I want to convert it to
 JavaPairRDDIndex,Object.. Index should be unique and it should maintain
 the order. For first object It should have 1 and then for second 2 like
 that.

 I tried using ZipWithIndex but it will return something like
 JavaPairRDDObject, long
 I wanted to use this RDD for lookup and join operation later in my
 workflow so ordering is important.


 Regards
 jeet

Re: regarding ZipWithIndex

2015-04-16 Thread Jeetendra Gangele

Hi Ted.
This works for me. But since Long takes here 8 bytes. Can I reduce it to 4
bytes. its just a index and I feel 4 bytes was more than enough.is there
any method which takes Integer or similar for Index?


On 13 April 2015 at 01:59, Ted Yu yuzhih...@gmail.com wrote:

 bq. will return something like JavaPairRDDObject, long

 The long component of the pair fits your description of index. What other
 requirement does ZipWithIndex not provide you ?

 Cheers

 On Sun, Apr 12, 2015 at 1:16 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Hi All I have an RDD JavaRDDObject and I want to convert it to
 JavaPairRDDIndex,Object.. Index should be unique and it should maintain
 the order. For first object It should have 1 and then for second 2 like
 that.

 I tried using ZipWithIndex but it will return something like
 JavaPairRDDObject, long
 I wanted to use this RDD for lookup and join operation later in my
 workflow so ordering is important.


 Regards
 jeet

Re: regarding ZipWithIndex

2015-04-16 Thread Jeetendra Gangele

I type T i already have Object ... I have RDDObject and then I am calling
ZipWithIndex on this RDD and getting RDDObject,Long on this I am running
MapToPair and converting into RDDLong,Object so that i can use it later
for other operation like lookup and join.

On 16 April 2015 at 23:42, Ted Yu yuzhih...@gmail.com wrote:

 The Long in RDD[(T, Long)] is type parameter. You can create RDD with
 Integer as the first type parameter.

 Cheers

 On Thu, Apr 16, 2015 at 11:07 AM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Hi Ted.
 This works for me. But since Long takes here 8 bytes. Can I reduce it to
 4 bytes. its just a index and I feel 4 bytes was more than enough.is
 there any method which takes Integer or similar for Index?


 On 13 April 2015 at 01:59, Ted Yu yuzhih...@gmail.com wrote:

 bq. will return something like JavaPairRDDObject, long

 The long component of the pair fits your description of index. What
 other requirement does ZipWithIndex not provide you ?

 Cheers

 On Sun, Apr 12, 2015 at 1:16 PM, Jeetendra Gangele gangele...@gmail.com
  wrote:

 Hi All I have an RDD JavaRDDObject and I want to convert it to
 JavaPairRDDIndex,Object.. Index should be unique and it should maintain
 the order. For first object It should have 1 and then for second 2 like
 that.

 I tried using ZipWithIndex but it will return something like
 JavaPairRDDObject, long
 I wanted to use this RDD for lookup and join operation later in my
 workflow so ordering is important.


 Regards
 jeet

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele

at distinct level I will have 7000 times more elements in my RDD.So should
I re partition? because its parent will definitely have less partition how
to see through java code number of partition?


On 16 April 2015 at 23:07, Jeetendra Gangele gangele...@gmail.com wrote:

 No I did not tried the partitioning below is the full code

 public static  void  matchAndMerge(JavaRDDVendorRecord
 matchRdd,JavaSparkContext jsc) throws IOException{
  long start = System.currentTimeMillis();
   JavaPairRDDLong, MatcherReleventData RddForMarch
 =matchRdd.zipWithIndex().mapToPair(new
 PairFunctionTuple2VendorRecord,Long, Long, MatcherReleventData() {

 @Override
 public Tuple2Long, MatcherReleventData call(Tuple2VendorRecord, Long t)
 throws Exception {
 MatcherReleventData matcherData = new MatcherReleventData();
 Tuple2Long, MatcherReleventData tuple = new Tuple2Long,
 MatcherReleventData(t._2,
 matcherData.convertVendorDataToMatcherData(t._1));
  return tuple;
 }

 }).cache();
  log.info(after index+RddForMarch.take(1));
  MapLong, MatcherReleventData tmp =RddForMarch.collectAsMap();
 MapLong, MatcherReleventData matchData = new HashMapLong,
 MatcherReleventData(tmp);
 final BroadcastMapLong, MatcherReleventData dataMatchGlobal =
 jsc.broadcast(matchData);

 JavaPairRDDLong,String blockingRdd = RddForMarch.flatMapValues(new
 FunctionMatcherReleventData, IterableString(){

 @Override
 public IterableString call(MatcherReleventData v1)
 throws Exception {
 ListString values = new ArrayListString();
 HelperUtilities helper1 = new HelperUtilities();
 MatcherKeys matchkeys=helper1.getBlockinkeys(v1);
 if(matchkeys.get_companyName() !=null){
 values.add(matchkeys.get_companyName());
 }
 if(matchkeys.get_phoneNumberr() !=null){
 values.add(matchkeys.get_phoneNumberr());
 }
 if(matchkeys.get_zipCode() !=null){
 values.add(matchkeys.get_zipCode());
 }
 if(matchkeys.getM_domain() !=null){
 values.add(matchkeys.getM_domain());
 }
   return values;
 }
  });
  log.info(blocking RDD is+blockingRdd.count());
 int count=0;
 log.info(Starting printing);
   for (Tuple2Long, String entry : blockingRdd.collect()) {

   log.info(entry._1() + : + entry._2());
   count++;
 }
   log.info(total count+count);
  JavaPairRDDLong,Integer
 completeDataToprocess=blockingRdd.flatMapValues( new FunctionString,
 IterableInteger(){

 @Override
 public IterableInteger call(String v1) throws Exception {
 return ckdao.getSingelkeyresult(v1);
 }
  }).distinct(32);
  log.info(after hbase count is+completeDataToprocess.count());
  log.info(data for process+completeDataToprocess.take(1));
  JavaPairRDDLong, Tuple2Integer, Double withScore
 =completeDataToprocess.mapToPair( new PairFunctionTuple2Long,Integer,
 Long, Tuple2Integer, Double(){

 @Override
 public Tuple2Long, Tuple2Integer, Double call(Tuple2Long, Integer t)
 throws Exception {
 Scoring scoreObj = new Scoring();
 double score =scoreObj.computeMatchScore(companyDAO.get(t._2()),
 dataMatchGlobal.getValue().get(t._1()));
 Tuple2Integer, Double maptuple = new Tuple2Integer, Double(t._2(),
 score);
 Tuple2Long, Tuple2Integer, Double tuple = new Tuple2Long,
 Tuple2Integer,Double(t._1(), maptuple);
 return tuple;
 }
  });
  log.info(with score tuple is+withScore.take(1));
  JavaPairRDDLong, Tuple2Integer,Double maxScoreRDD
 =withScore.reduceByKey( new Function2Tuple2Integer,Double,
 Tuple2Integer,Double, Tuple2Integer,Double(){

 @Override
 public Tuple2Integer, Double call(Tuple2Integer, Double v1,
 Tuple2Integer, Double v2) throws Exception {
  int res =v1._2().compareTo(v2._2());
 if(res 0){
  Tuple2Integer, Double result = new Tuple2Integer, Double(v1._1(),
 v1._2());
 return result;
  }
 else if(res0){
 Tuple2Integer, Double result = new Tuple2Integer, Double(v2._1(),
 v2._2());
 return result;
 }
 else{
 Tuple2Integer, Double result = new Tuple2Integer, Double(v2._1(),
 v2._2());
 return result;
 }
   }
  });
  log.info(max score RDD+maxScoreRDD.take(10));

  maxScoreRDD.foreach( new
 VoidFunctionTuple2Long,Tuple2Integer,Double(){

 @Override
 public void call(Tuple2Long, Tuple2Integer, Double t)
 throws Exception {
 MatcherReleventData matchedData=dataMatchGlobal.getValue().get(t._1());
 log.info(broadcast is+dataMatchGlobal.getValue().get(t._1()));
 //Set the score for better understanding of merge
 matchedData.setScore(t._2()._2());
 vdDoa.updateMatchedRecordWithScore(matchedData, t._2()._1(),Souce_id);
  }
  });
  log.info(took  + (System.currentTimeMillis() - start) +  mills to run
 matcher);



  }


 On 16 April 2015 at 22:25, Akhil Das ak...@sigmoidanalytics.com wrote:

 Can you paste your complete code? Did you try repartioning/increasing
 level of parallelism to speed up the processing. Since you have 16 cores,
 and I'm assuming your 400k records isn't bigger than a 10G dataset.

 Thanks
 Best Regards

 On Thu, Apr 16, 2015 at 10:00 PM, Jeetendra Gangele gangele...@gmail.com
  wrote:

 I already checked and G is taking 1 secs for each task. is this too
 much? if yes how to avoid

Execption while using kryo with broadcast

2015-04-15 Thread Jeetendra Gangele

Hi All I am getting below exception while using Kyro serializable with
broadcast variable. I am broadcating a hasmap with below line.

MapLong, MatcherReleventData matchData =RddForMarch.collectAsMap();
final BroadcastMapLong, MatcherReleventData dataMatchGlobal =
jsc.broadcast(matchData);






15/04/15 12:58:51 ERROR executor.Executor: Exception in task 0.3 in stage
4.0 (TID 7)
java.io.IOException: java.lang.UnsupportedOperationException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1003)
at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
at com.insideview.yam.CompanyMatcher$4.call(CompanyMatcher.java:103)
at com.insideview.yam.CompanyMatcher$4.call(CompanyMatcher.java:1)
at
org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002)
at
org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:204)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.UnsupportedOperationException
at java.util.AbstractMap.put(AbstractMap.java:203)
at
com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
at
com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:142)
at
org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:177)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1000)
... 18 more
15/04/15 12:58:51 INFO executor.CoarseGrainedExecutorBackend: Driver
commanded a shutdown
15/04/15 12:58:51 INFO storage.MemoryStore: MemoryStore cleared
15/04/15 12:58:51 INFO storage.BlockManager: BlockManager stopped
15/04/15 12:58:51 INFO remote.RemoteActorRefProvider$RemotingTerminator:
Shutting down remote daemon.

Re: Execption while using kryo with broadcast

2015-04-15 Thread Jeetendra Gangele

Yes Without Kryo it did work out.when I remove kryo registration it did
worked out

On 15 April 2015 at 19:24, Jeetendra Gangele gangele...@gmail.com wrote:

 its not working with the combination of Broadcast.
 Without Kyro also not working.


 On 15 April 2015 at 19:20, Akhil Das ak...@sigmoidanalytics.com wrote:

 Is it working without kryo?

 Thanks
 Best Regards

 On Wed, Apr 15, 2015 at 6:38 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Hi All I am getting below exception while using Kyro serializable with
 broadcast variable. I am broadcating a hasmap with below line.

 MapLong, MatcherReleventData matchData =RddForMarch.collectAsMap();
 final BroadcastMapLong, MatcherReleventData dataMatchGlobal =
 jsc.broadcast(matchData);






 15/04/15 12:58:51 ERROR executor.Executor: Exception in task 0.3 in
 stage 4.0 (TID 7)
 java.io.IOException: java.lang.UnsupportedOperationException
 at
 org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1003)
 at
 org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
 at
 org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
 at
 org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
 at
 org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
 at
 com.insideview.yam.CompanyMatcher$4.call(CompanyMatcher.java:103)
 at
 com.insideview.yam.CompanyMatcher$4.call(CompanyMatcher.java:1)
 at
 org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002)
 at
 org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:204)
 at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.UnsupportedOperationException
 at java.util.AbstractMap.put(AbstractMap.java:203)
 at
 com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
 at
 com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
 at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:142)
 at
 org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216)
 at
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:177)
 at
 org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1000)
 ... 18 more
 15/04/15 12:58:51 INFO executor.CoarseGrainedExecutorBackend: Driver
 commanded a shutdown
 15/04/15 12:58:51 INFO storage.MemoryStore: MemoryStore cleared
 15/04/15 12:58:51 INFO storage.BlockManager: BlockManager stopped
 15/04/15 12:58:51 INFO remote.RemoteActorRefProvider$RemotingTerminator:
 Shutting down remote daemon.

Re: Execption while using kryo with broadcast

2015-04-15 Thread Jeetendra Gangele

This looks like known issue? check this out
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-InvalidClassException-org-apache-spark-api-java-JavaUtils-SerializableMapWrapper-no-valid-cor-td20034.html

Can you please suggest any work around I am broad casting HashMap return
from RDD.collectasMap().

On 15 April 2015 at 19:33, Imran Rashid iras...@cloudera.com wrote:

 this is a really strange exception ... I'm especially surprised that it
 doesn't work w/ java serialization.  Do you think you could try to boil it
 down to a minimal example?

 On Wed, Apr 15, 2015 at 8:58 AM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Yes Without Kryo it did work out.when I remove kryo registration it did
 worked out

 On 15 April 2015 at 19:24, Jeetendra Gangele gangele...@gmail.com
 wrote:

 its not working with the combination of Broadcast.
 Without Kyro also not working.


 On 15 April 2015 at 19:20, Akhil Das ak...@sigmoidanalytics.com wrote:

 Is it working without kryo?

 Thanks
 Best Regards

 On Wed, Apr 15, 2015 at 6:38 PM, Jeetendra Gangele 
 gangele...@gmail.com wrote:

 Hi All I am getting below exception while using Kyro serializable with
 broadcast variable. I am broadcating a hasmap with below line.

 MapLong, MatcherReleventData matchData =RddForMarch.collectAsMap();
 final BroadcastMapLong, MatcherReleventData dataMatchGlobal =
 jsc.broadcast(matchData);






 15/04/15 12:58:51 ERROR executor.Executor: Exception in task 0.3 in
 stage 4.0 (TID 7)
 java.io.IOException: java.lang.UnsupportedOperationException
 at
 org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1003)
 at
 org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
 at
 org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
 at
 org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
 at
 org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
 at
 com.insideview.yam.CompanyMatcher$4.call(CompanyMatcher.java:103)
 at
 com.insideview.yam.CompanyMatcher$4.call(CompanyMatcher.java:1)
 at
 org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002)
 at
 org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:204)
 at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.UnsupportedOperationException
 at java.util.AbstractMap.put(AbstractMap.java:203)
 at
 com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
 at
 com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
 at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:142)
 at
 org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216)
 at
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:177)
 at
 org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1000)
 ... 18 more
 15/04/15 12:58:51 INFO executor.CoarseGrainedExecutorBackend: Driver
 commanded a shutdown
 15/04/15 12:58:51 INFO storage.MemoryStore: MemoryStore cleared
 15/04/15 12:58:51 INFO storage.BlockManager: BlockManager stopped
 15/04/15 12:58:51 INFO
 remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote
 daemon.

Re: Execption while using kryo with broadcast

2015-04-15 Thread Jeetendra Gangele

its not working with the combination of Broadcast.
Without Kyro also not working.

On 15 April 2015 at 19:20, Akhil Das ak...@sigmoidanalytics.com wrote:

 Is it working without kryo?

 Thanks
 Best Regards

 On Wed, Apr 15, 2015 at 6:38 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Hi All I am getting below exception while using Kyro serializable with
 broadcast variable. I am broadcating a hasmap with below line.

 MapLong, MatcherReleventData matchData =RddForMarch.collectAsMap();
 final BroadcastMapLong, MatcherReleventData dataMatchGlobal =
 jsc.broadcast(matchData);






 15/04/15 12:58:51 ERROR executor.Executor: Exception in task 0.3 in stage
 4.0 (TID 7)
 java.io.IOException: java.lang.UnsupportedOperationException
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1003)
 at
 org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
 at
 org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
 at
 org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
 at
 org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
 at
 com.insideview.yam.CompanyMatcher$4.call(CompanyMatcher.java:103)
 at com.insideview.yam.CompanyMatcher$4.call(CompanyMatcher.java:1)
 at
 org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002)
 at
 org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:204)
 at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.UnsupportedOperationException
 at java.util.AbstractMap.put(AbstractMap.java:203)
 at
 com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
 at
 com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
 at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:142)
 at
 org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216)
 at
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:177)
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1000)
 ... 18 more
 15/04/15 12:58:51 INFO executor.CoarseGrainedExecutorBackend: Driver
 commanded a shutdown
 15/04/15 12:58:51 INFO storage.MemoryStore: MemoryStore cleared
 15/04/15 12:58:51 INFO storage.BlockManager: BlockManager stopped
 15/04/15 12:58:51 INFO remote.RemoteActorRefProvider$RemotingTerminator:
 Shutting down remote daemon.

Re: Execption while using kryo with broadcast

2015-04-15 Thread Jeetendra Gangele

This worked with java serialization.I am using 1.2.0 you are right if I use
1.2.1 or 1.3.0 this issue will not occur
I will test this and let you know

On 15 April 2015 at 19:48, Imran Rashid iras...@cloudera.com wrote:

 oh interesting.  The suggested workaround is to wrap the result from
 collectAsMap into another hashmap, you should try that:

 MapLong, MatcherReleventData matchData =RddForMarch.collectAsMap();
 MapString, String tmp = new HashMapString, String(matchData);
 final BroadcastMapLong, MatcherReleventData dataMatchGlobal =
 jsc.broadcast(tmp);

 Can you please clarify:
 * Does it work w/ java serialization in the end?  Or is this kryo only?
 * which Spark version you are using? (one of the relevant bugs was fixed
 in 1.2.1 and 1.3.0)



 On Wed, Apr 15, 2015 at 9:06 AM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 This looks like known issue? check this out

 http://apache-spark-user-list.1001560.n3.nabble.com/java-io-InvalidClassException-org-apache-spark-api-java-JavaUtils-SerializableMapWrapper-no-valid-cor-td20034.html

 Can you please suggest any work around I am broad casting HashMap return
 from RDD.collectasMap().

 On 15 April 2015 at 19:33, Imran Rashid iras...@cloudera.com wrote:

 this is a really strange exception ... I'm especially surprised that it
 doesn't work w/ java serialization.  Do you think you could try to boil it
 down to a minimal example?

 On Wed, Apr 15, 2015 at 8:58 AM, Jeetendra Gangele gangele...@gmail.com
  wrote:

 Yes Without Kryo it did work out.when I remove kryo registration it did
 worked out

 On 15 April 2015 at 19:24, Jeetendra Gangele gangele...@gmail.com
 wrote:

 its not working with the combination of Broadcast.
 Without Kyro also not working.


 On 15 April 2015 at 19:20, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Is it working without kryo?

 Thanks
 Best Regards

 On Wed, Apr 15, 2015 at 6:38 PM, Jeetendra Gangele 
 gangele...@gmail.com wrote:

 Hi All I am getting below exception while using Kyro serializable
 with broadcast variable. I am broadcating a hasmap with below line.

 MapLong, MatcherReleventData matchData =RddForMarch.collectAsMap();
 final BroadcastMapLong, MatcherReleventData dataMatchGlobal =
 jsc.broadcast(matchData);






 15/04/15 12:58:51 ERROR executor.Executor: Exception in task 0.3 in
 stage 4.0 (TID 7)
 java.io.IOException: java.lang.UnsupportedOperationException
 at
 org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1003)
 at
 org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
 at
 org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
 at
 org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
 at
 org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
 at
 com.insideview.yam.CompanyMatcher$4.call(CompanyMatcher.java:103)
 at
 com.insideview.yam.CompanyMatcher$4.call(CompanyMatcher.java:1)
 at
 org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002)
 at
 org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002)
 at
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:204)
 at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.UnsupportedOperationException
 at java.util.AbstractMap.put(AbstractMap.java:203)
 at
 com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
 at
 com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
 at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:142)
 at
 org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216)
 at
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:177)
 at
 org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1000)
 ... 18 more
 15/04/15 12:58:51 INFO executor.CoarseGrainedExecutorBackend: Driver
 commanded a shutdown
 15/04/15 12:58:51 INFO

exception during foreach run

2015-04-15 Thread Jeetendra Gangele

Hi All

I am getting below exception while running foreach after zipwithindex
,flatMapvalue,flatmapvalues,
Insideview foreach I m doing lookup in broadcast variable


java.util.concurrent.RejectedExecutionException: Worker has already been
shutdown
at
org.jboss.netty.channel.socket.nio.AbstractNioSelector.registerTask(AbstractNioSelector.java:120)
at
org.jboss.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:72)
at
org.jboss.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:36)
at
org.jboss.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:56)
at
org.jboss.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:36)
at
org.jboss.netty.channel.socket.nio.AbstractNioChannelSink.execute(AbstractNioChannelSink.java:34)
at
org.jboss.netty.channel.Channels.fireExceptionCaughtLater(Channels.java:496)
at
org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:46)
at
org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:54)
at org.jboss.netty.channel.Channels.disconnect(Channels.java:781)
at
org.jboss.netty.channel.AbstractChannel.disconnect(AbstractChannel.java:211)
at
akka.remote.transport.netty.NettyTransport$$anonfun$gracefulClose$1.apply(NettyTransport.scala:223)
at
akka.remote.transport.netty.NettyTransport$$anonfun$gracefulClose$1.apply(NettyTransport.scala:222)
at scala.util.Success.foreach(Try.scala:205)
at
scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:204)
at
scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:204)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at
scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at
akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
:

Re: regarding ZipWithIndex

2015-04-13 Thread Jeetendra Gangele

How about using mapToPair and exchanging the two. Will it be efficient
Below is the code , will it be efficient to convert like this.


JavaPairRDDLong, MatcherReleventData RddForMarch
=matchRdd.zipWithindex.mapToPair(new
PairFunctionTuple2VendorRecord,Long, Long, MatcherReleventData() {

@Override
public Tuple2Long, MatcherReleventData call(Tuple2VendorRecord, Long t)
throws Exception {
MatcherReleventData matcherData = new MatcherReleventData();
Tuple2Long, MatcherReleventData tuple = new Tuple2Long,
MatcherReleventData(t._2,
matcherData.convertVendorDataToMatcherData(t._1));
 return tuple;
}

}).cache();

On 13 April 2015 at 03:11, Ted Yu yuzhih...@gmail.com wrote:

 Please also take a look at ZippedWithIndexRDDPartition which is 72 lines
 long.

 You can create your own version which extends RDD[(Long, T)]

 Cheers

 On Sun, Apr 12, 2015 at 1:29 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. will return something like JavaPairRDDObject, long

 The long component of the pair fits your description of index. What other
 requirement does ZipWithIndex not provide you ?

 Cheers

 On Sun, Apr 12, 2015 at 1:16 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Hi All I have an RDD JavaRDDObject and I want to convert it to
 JavaPairRDDIndex,Object.. Index should be unique and it should maintain
 the order. For first object It should have 1 and then for second 2 like
 that.

 I tried using ZipWithIndex but it will return something like
 JavaPairRDDObject, long
 I wanted to use this RDD for lookup and join operation later in my
 workflow so ordering is important.


 Regards
 jeet

Help in transforming the RDD

2015-04-13 Thread Jeetendra Gangele

Hi All I have an JavaPairRDDLong,String where each long key have 4
 string values associated with it. I want to fire the Hbase query for look
up of the  each String part of RDD.
This look-up will give result of around 7K integers.so for each key I will
have 7k values. now my  input RDD always already more than GB and after
getting these result it will become around 50 GB which  I want avoid .

My problem. 1, Test1
1,test2
 1.test3
 1, test4
 ...
 .
Now I will query Hbase for Test1, test2 test3 ,test4 in parallel ech query
will give result around 2K so total 8k of integers.

Now for each record I will have 1*8000 entries in my RDD and suppose I have
1 million record it will become 1 million*8000 will is huge to process even
using GroupBy.

Re: function to convert to pair

2015-04-12 Thread Jeetendra Gangele

I have to create some kind of index from my JavaRDDObject it should be
something like javaPairRDDuniqueindex, Object

but zipWith Index giving Object, Long later I need to use this RDD for
join so its looks it wont work for me.

On 9 April 2015 at 04:17, Ted Yu yuzhih...@gmail.com wrote:

 Please take a look at zipWithIndex() of RDD.

 Cheers

 On Wed, Apr 8, 2015 at 3:40 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Hi All I have a RDDSomeObject I want to convert it to
 RDDsequenceNumber,SomeObject this sequence number can be 1 for first
 SomeObject 2 for second SomeOjejct


 Regards
 jeet

regarding ZipWithIndex

2015-04-12 Thread Jeetendra Gangele

Hi All I have an RDD JavaRDDObject and I want to convert it to
JavaPairRDDIndex,Object.. Index should be unique and it should maintain
the order. For first object It should have 1 and then for second 2 like
that.

I tried using ZipWithIndex but it will return something like
JavaPairRDDObject, long
I wanted to use this RDD for lookup and join operation later in my workflow
so ordering is important.


Regards
jeet

Taks going into NODE_LOCAL at beginning of job

2015-04-11 Thread Jeetendra Gangele

I have 3 transformation and then I am running for each job is going
Process is going in NODE_LOCAL level and no executor in waiting for long
time
no task is running.

Regarding
Jeetendra

foreach going in infinite loop

2015-04-10 Thread Jeetendra Gangele

Hi All I am running below code before calling foreach i did 3
transformation using MapTopair. In my application there are 16 executed but
no executed running anything.

rddWithscore.foreach(new
VoidFunctionTuple2VendorRecord,MapInteger,Double() {

@Override
public void call(Tuple2VendorRecord, MapInteger, Double t)
throws Exception {
 EntryInteger, Double maxEntry = null;

for(EntryInteger, Double entry : t._2.entrySet()) {
if (maxEntry == null || entry.getValue()  maxEntry.getValue()) {
maxEntry = entry;
   // updateVendorData(maxEntry.getKey());

}
log.info(for vendor :+ t._1.getVendorId()+matched company
is+maxEntry.getKey());
}
}


});

Need subscription process

2015-04-08 Thread Jeetendra Gangele

Hi All how can I subscribe myself in this group so that every mail sent to
this group comes to me as well.
I already sent  request to user-subscr...@spark.apache.org ,still Iam not
getting mail sent to this group by other persons.


Regards
Jeetendra

Regarding GroupBy

2015-04-08 Thread Jeetendra Gangele

I wanted to run the groupBy(partition ) but this is not working.
here first part in pairvendorData  will be repeated multiple second part.
Both are object do I need to overrite the equals and hash code?
Is groupBy fast enough?

JavaPairRDDVendorRecord, VendorRecord pairvendorData
=matchRdd.flatMapToPair( new PairFlatMapFunctionVendorRecord,
VendorRecord, VendorRecord(){

@Override
public IterableTuple2VendorRecord,VendorRecord call(
VendorRecord t) throws Exception {
ListTuple2VendorRecord, VendorRecord pairs = new
LinkedListTuple2VendorRecord, VendorRecord();
CompanyMatcherHelper helper = new CompanyMatcherHelper();
 MatcherKeys matchkeys=helper.getBlockinkeys(t);
ListVendorRecord Matchedrecords =ckdao.getMatchingRecordCknids(matchkeys);
log.info(List Size is+Matchedrecords.size());
for(int i=0;iMatchedrecords.size();i++){
pairs.add( new Tuple2VendorRecord,VendorRecord(t,Matchedrecords.get(i)));
}
 return pairs;
}
 }
);

function to convert to pair

2015-04-08 Thread Jeetendra Gangele

Hi All I have a RDDSomeObject I want to convert it to
RDDsequenceNumber,SomeObject this sequence number can be 1 for first
SomeObject 2 for second SomeOjejct


Regards
jeet

Re: task not serialize

2015-04-07 Thread Jeetendra Gangele

Lets say I follow below approach and I got RddPair with huge size .. which
can not fit into one machine ... what to run foreach on this RDD?

On 7 April 2015 at 04:25, Jeetendra Gangele gangele...@gmail.com wrote:



 On 7 April 2015 at 04:03, Dean Wampler deanwamp...@gmail.com wrote:


 On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Thanks a lot.That means Spark does not support the nested RDD?
 if I pass the javaSparkContext that also wont work. I mean passing
 SparkContext not possible since its not serializable

 That's right. RDD don't nest and SparkContexts aren't serializable.


 i have a requirement where I will get JavaRDDVendorRecord matchRdd
 and I need to return the postential matches for this record from Hbase. so
 for each field of VendorRecord I have to do following

 1. query Hbase to get the list of potential record in RDD
 2. run logistic regression on RDD return from steps 1 and each element
 of the passed matchRdd.

 If I understand you correctly, each VectorRecord could correspond to
 0-to-N records in HBase, which you need to fetch, true?

  yes thats correct each Vendorrecord corresponds to 0 to N matches


 If so, you could use the RDD flatMap method, which takes a function a
 that accepts each record, then returns a sequence of 0-to-N new records of
 some other type, like your HBase records. However, running an HBase query
 for each VendorRecord could be expensive. If you can turn this into a range
 query or something like that, it would help. I haven't used HBase much, so
 I don't have good advice on optimizing this, if necessary.

 Alternatively, can you do some sort of join on the VendorRecord RDD and
 an RDD of query results from HBase?

  Join will give too big result RDD of query result is returning around
 1 for each record and i have 2 millions to process so it will be huge
 to have this. 2 m*1 big number


 For #2, it sounds like you need flatMap to return records that combine
 the input VendorRecords and fields pulled from HBase.

 Whatever you can do to make this work like table scans and joins will
 probably be most efficient.

 dean




 On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote:

 The log instance won't be serializable, because it will have a file
 handle to write to. Try defining another static method outside
 matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper
 might not be serializable either, but you didn't provide it. If it holds a
 database connection, same problem.

 You can't suppress the warning because it's actually an error. The
 VoidFunction can't be serialized to send it over the cluster's network.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele gangele...@gmail.com
  wrote:

 In this code in foreach I am getting task not serialized exception


 @SuppressWarnings(serial)
 public static  void  matchAndMerge(JavaRDDVendorRecord matchRdd,
  final JavaSparkContext jsc) throws IOException{
 log.info(Company matcher started);
 //final JavaSparkContext jsc = getSparkContext();
   matchRdd.foreachAsync(new VoidFunctionVendorRecord(){
 @Override
 public void call(VendorRecord t) throws Exception {
  if(t !=null){
 try{
 CompanyMatcherHelper.UpdateMatchedRecord(jsc,t);
  } catch (Exception e) {
 log.error(ERROR while running Matcher for company  +
 t.getCompanyId(), e);
 }
 }
  }
 });

  }














-

Re: task not serialize

2015-04-07 Thread Jeetendra Gangele

I thinking to follow the below approach(in my class hbase also return the
same object which i will get in RDD)
.1 First run the  flatMapPairf

JavaPairRDDVendorRecord, IterableVendorRecord pairvendorData
=matchRdd.flatMapToPair( new PairFlatMapFunctionVendorRecord,
VendorRecord, VendorRecord(){

@Override
public IterableTuple2VendorRecord,VendorRecord call(
VendorRecord t) throws Exception {
ListTuple2VendorRecord, VendorRecord pairs = new
LinkedListTuple2VendorRecord, VendorRecord();
MatcherKeys matchkeys=CompanyMatcherHelper.getBlockinkeys(t);
ListVendorRecord Matchedrecords
=ckdao.getMatchingRecordsWithscan(matchkeys);
for(int i=0;iMatchedrecords.size();i++){
pairs.add( new Tuple2VendorRecord,VendorRecord(t,Matchedrecords.get(i)));
}
 return pairs;
}
 }
).groupByKey(200);
Question will it store the returned RDD in one Node? or it only bring when
I run the second step?  in groupBy if I increase the partiotionNumber will
it increase the prformance

 2. Then apply mapPartition on this RDD and do logistic regression here.my
my issue is my logistic regression function take






On 7 April 2015 at 18:38, Dean Wampler deanwamp...@gmail.com wrote:

 Foreach() runs in parallel across the cluster, like map, flatMap, etc.
 You'll only run into problems if you call collect(), which brings the
 entire RDD into memory in the driver program.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Tue, Apr 7, 2015 at 3:50 AM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Lets say I follow below approach and I got RddPair with huge size ..
 which can not fit into one machine ... what to run foreach on this RDD?

 On 7 April 2015 at 04:25, Jeetendra Gangele gangele...@gmail.com wrote:



 On 7 April 2015 at 04:03, Dean Wampler deanwamp...@gmail.com wrote:


 On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com
  wrote:

 Thanks a lot.That means Spark does not support the nested RDD?
 if I pass the javaSparkContext that also wont work. I mean passing
 SparkContext not possible since its not serializable

 That's right. RDD don't nest and SparkContexts aren't serializable.


 i have a requirement where I will get JavaRDDVendorRecord matchRdd
 and I need to return the postential matches for this record from Hbase. so
 for each field of VendorRecord I have to do following

 1. query Hbase to get the list of potential record in RDD
 2. run logistic regression on RDD return from steps 1 and each element
 of the passed matchRdd.

 If I understand you correctly, each VectorRecord could correspond to
 0-to-N records in HBase, which you need to fetch, true?

  yes thats correct each Vendorrecord corresponds to 0 to N matches


 If so, you could use the RDD flatMap method, which takes a function a
 that accepts each record, then returns a sequence of 0-to-N new records of
 some other type, like your HBase records. However, running an HBase query
 for each VendorRecord could be expensive. If you can turn this into a range
 query or something like that, it would help. I haven't used HBase much, so
 I don't have good advice on optimizing this, if necessary.

 Alternatively, can you do some sort of join on the VendorRecord RDD and
 an RDD of query results from HBase?

  Join will give too big result RDD of query result is returning around
 1 for each record and i have 2 millions to process so it will be huge
 to have this. 2 m*1 big number


 For #2, it sounds like you need flatMap to return records that combine
 the input VendorRecords and fields pulled from HBase.

 Whatever you can do to make this work like table scans and joins will
 probably be most efficient.

 dean




 On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote:

 The log instance won't be serializable, because it will have a file
 handle to write to. Try defining another static method outside
 matchAndMerge that encapsulates the call to log.error. 
 CompanyMatcherHelper
 might not be serializable either, but you didn't provide it. If it holds 
 a
 database connection, same problem.

 You can't suppress the warning because it's actually an error. The
 VoidFunction can't be serialized to send it over the cluster's network.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele 
 gangele...@gmail.com wrote:

 In this code in foreach I am getting task not serialized exception


 @SuppressWarnings(serial)
 public static  void  matchAndMerge(JavaRDDVendorRecord matchRdd,
  final JavaSparkContext jsc) throws IOException{
 log.info(Company matcher started);
 //final JavaSparkContext jsc = getSparkContext

FlatMapPair run for longer time

2015-04-07 Thread Jeetendra Gangele

Hi All I am running the below code and its running for very long time where
input to flatMapTopair is record of 50K. and I am calling Hbase for 50K
times just a range scan query to should not take time. can anybody guide me
what is wrong here?

JavaPairRDDVendorRecord, IterableVendorRecord pairvendorData
=matchRdd.flatMapToPair( new PairFlatMapFunctionVendorRecord,
VendorRecord, VendorRecord(){

@Override
public IterableTuple2VendorRecord,VendorRecord call(
VendorRecord t) throws Exception {
ListTuple2VendorRecord, VendorRecord pairs = new
LinkedListTuple2VendorRecord, VendorRecord();
MatcherKeys matchkeys=CompanyMatcherHelper.getBlockinkeys(t);
ListVendorRecord Matchedrecords
=ckdao.getMatchingRecordsWithscan(matchkeys);
for(int i=0;iMatchedrecords.size();i++){
pairs.add( new Tuple2VendorRecord,VendorRecord(t,Matchedrecords.get(i)));
}
 return pairs;
}
 }
).groupByKey(200).persist(StorageLevel.DISK_ONLY_2());

task not serialize

2015-04-06 Thread Jeetendra Gangele

In this code in foreach I am getting task not serialized exception


@SuppressWarnings(serial)
public static  void  matchAndMerge(JavaRDDVendorRecord matchRdd,  final
JavaSparkContext jsc) throws IOException{
log.info(Company matcher started);
//final JavaSparkContext jsc = getSparkContext();
  matchRdd.foreachAsync(new VoidFunctionVendorRecord(){
@Override
public void call(VendorRecord t) throws Exception {
 if(t !=null){
try{
CompanyMatcherHelper.UpdateMatchedRecord(jsc,t);
 } catch (Exception e) {
log.error(ERROR while running Matcher for company  + t.getCompanyId(), e);
}
}
 }
});

 }

Re: java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2015-04-06 Thread Jeetendra Gangele

I hit again same issue This time I tried to return the Object it failed
with task not serialized below is the code
here vendor record is serializable

private static JavaRDDVendorRecord
getVendorDataToProcess(JavaSparkContext sc) throws IOException {
 return sc
.newAPIHadoopRDD(getVendorDataRowKeyScannerConfiguration(),
TableInputFormat.class,
ImmutableBytesWritable.class, Result.class)
.map(new FunctionTuple2ImmutableBytesWritable, Result,
VendorRecord() {
@Override
public VendorRecord call(Tuple2ImmutableBytesWritable, Result v1)
throws Exception {
String rowKey = new String(v1._1.get());
 VendorRecord vd=vendorDataDAO.getVendorDataForRowkey(rowKey);
 return vd;
}
});
 }


On 1 April 2015 at 02:07, Ted Yu yuzhih...@gmail.com wrote:

 Jeetendra:
 Please extract the information you need from Result and return the
 extracted portion - instead of returning Result itself.

 Cheers

 On Tue, Mar 31, 2015 at 1:14 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

 The example in
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala
  might
 help

 Best,

 --
 Nan Zhu
 http://codingcat.me

 On Tuesday, March 31, 2015 at 3:56 PM, Sean Owen wrote:

 Yep, it's not serializable:

 https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html

 You can't return this from a distributed operation since that would
 mean it has to travel over the network and you haven't supplied any
 way to convert the thing into bytes.

 On Tue, Mar 31, 2015 at 8:51 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 When I am trying to get the result from Hbase and running mapToPair
 function
 of RRD its giving the error
 java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

 Here is the code

 // private static JavaPairRDDInteger, Result
 getCompanyDataRDD(JavaSparkContext sc) throws IOException {
 // return sc.newAPIHadoopRDD(companyDAO.getCompnayDataConfiguration(),
 TableInputFormat.class, ImmutableBytesWritable.class,
 // Result.class).mapToPair(new
 PairFunctionTuple2ImmutableBytesWritable, Result, Integer, Result() {
 //
 // public Tuple2Integer, Result call(Tuple2ImmutableBytesWritable,
 Result t) throws Exception {
 // System.out.println(In getCompanyDataRDD+t._2);
 //
 // String cknid = Bytes.toString(t._1.get());
 // System.out.println(processing cknids is:+cknid);
 // Integer cknidInt = Integer.parseInt(cknid);
 // Tuple2Integer, Result returnTuple = new Tuple2Integer,
 Result(cknidInt, t._2);
 // return returnTuple;
 // }
 // });
 // }


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: task not serialize

2015-04-06 Thread Jeetendra Gangele

Thanks a lot.That means Spark does not support the nested RDD?
if I pass the javaSparkContext that also wont work. I mean passing
SparkContext not possible since its not serializable

i have a requirement where I will get JavaRDDVendorRecord matchRdd and I
need to return the postential matches for this record from Hbase. so for
each field of VendorRecord I have to do following

1. query Hbase to get the list of potential record in RDD
2. run logistic regression on RDD return from steps 1 and each element of
the passed matchRdd.




On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote:

 The log instance won't be serializable, because it will have a file
 handle to write to. Try defining another static method outside
 matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper
 might not be serializable either, but you didn't provide it. If it holds a
 database connection, same problem.

 You can't suppress the warning because it's actually an error. The
 VoidFunction can't be serialized to send it over the cluster's network.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 In this code in foreach I am getting task not serialized exception


 @SuppressWarnings(serial)
 public static  void  matchAndMerge(JavaRDDVendorRecord matchRdd,  final
 JavaSparkContext jsc) throws IOException{
 log.info(Company matcher started);
 //final JavaSparkContext jsc = getSparkContext();
   matchRdd.foreachAsync(new VoidFunctionVendorRecord(){
 @Override
 public void call(VendorRecord t) throws Exception {
  if(t !=null){
 try{
 CompanyMatcherHelper.UpdateMatchedRecord(jsc,t);
  } catch (Exception e) {
 log.error(ERROR while running Matcher for company  + t.getCompanyId(),
 e);
 }
 }
  }
 });

  }

Re: task not serialize

2015-04-06 Thread Jeetendra Gangele

On 7 April 2015 at 04:03, Dean Wampler deanwamp...@gmail.com wrote:


 On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Thanks a lot.That means Spark does not support the nested RDD?
 if I pass the javaSparkContext that also wont work. I mean passing
 SparkContext not possible since its not serializable

 That's right. RDD don't nest and SparkContexts aren't serializable.
 


 i have a requirement where I will get JavaRDDVendorRecord matchRdd and
 I need to return the postential matches for this record from Hbase. so for
 each field of VendorRecord I have to do following

 1. query Hbase to get the list of potential record in RDD
 2. run logistic regression on RDD return from steps 1 and each element of
 the passed matchRdd.

 If I understand you correctly, each VectorRecord could correspond to
 0-to-N records in HBase, which you need to fetch, true?

 yes thats correct each Vendorrecord corresponds to 0 to N matches


 If so, you could use the RDD flatMap method, which takes a function a
 that accepts each record, then returns a sequence of 0-to-N new records of
 some other type, like your HBase records. However, running an HBase query
 for each VendorRecord could be expensive. If you can turn this into a range
 query or something like that, it would help. I haven't used HBase much, so
 I don't have good advice on optimizing this, if necessary.

 Alternatively, can you do some sort of join on the VendorRecord RDD and an
 RDD of query results from HBase?

 Join will give too big result RDD of query result is returning around
1 for each record and i have 2 millions to process so it will be huge
to have this. 2 m*1 big number


 For #2, it sounds like you need flatMap to return records that combine the
 input VendorRecords and fields pulled from HBase.

 Whatever you can do to make this work like table scans and joins will
 probably be most efficient.

 dean




 On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote:

 The log instance won't be serializable, because it will have a file
 handle to write to. Try defining another static method outside
 matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper
 might not be serializable either, but you didn't provide it. If it holds a
 database connection, same problem.

 You can't suppress the warning because it's actually an error. The
 VoidFunction can't be serialized to send it over the cluster's network.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 In this code in foreach I am getting task not serialized exception


 @SuppressWarnings(serial)
 public static  void  matchAndMerge(JavaRDDVendorRecord matchRdd,
  final JavaSparkContext jsc) throws IOException{
 log.info(Company matcher started);
 //final JavaSparkContext jsc = getSparkContext();
   matchRdd.foreachAsync(new VoidFunctionVendorRecord(){
 @Override
 public void call(VendorRecord t) throws Exception {
  if(t !=null){
 try{
 CompanyMatcherHelper.UpdateMatchedRecord(jsc,t);
  } catch (Exception e) {
 log.error(ERROR while running Matcher for company  +
 t.getCompanyId(), e);
 }
 }
  }
 });

  }

Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-05 Thread Jeetendra Gangele

 I am  already using STRATROW and ENDROW in Hbase from newAPIHadoopRDD.
Can I do similar with RDD?.lets say  use Filter in RDD  to get only those
records which matches the same Criteria mentioned in STARTROW and Stop
ROW.will it much faster than Hbase querying?

On 6 April 2015 at 03:15, Ted Yu yuzhih...@gmail.com wrote:

 bq. HBase scan operation like scan StartROW and EndROW in RDD?

 I don't think RDD supports concept of start row and end row.

 In HBase, please take a look at the following methods of Scan:

   public Scan setStartRow(byte [] startRow) {

   public Scan setStopRow(byte [] stopRow) {

 Cheers

 On Sun, Apr 5, 2015 at 2:35 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 I have  2GB hbase table where this data is store in the form on key and
 value(only one column per key) and key also unique

 What I thinking to load the complete hbase table into RDD and then do the
 operation like scan and all in RDD rather than Hbase.
 Can I do  HBase scan operation like scan StartROW and EndROW in RDD?

 Firrst steps in my job will be to load the complete data into RDD.



 On 6 April 2015 at 02:45, Ted Yu yuzhih...@gmail.com wrote:

 You do need to apply the patch since 0.96 doesn't have this feature.

 For JavaSparkContext.newAPIHadoopRDD, can you check region server
 metrics to see where the overhead might be (compared to creating scan
 and firing query using native client) ?

 Thanks

 On Sun, Apr 5, 2015 at 2:00 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Thats true I checked the MultiRowRangeFilter  and its serving my need.
 do I need to apply the patch? for this since I am using 0.96 hbase
 version.

 Also I have checked when I used JavaSparkContext.newAPIHadoopRDD its
 slow compare to creating scan and firing query, is there any reason?




 On 6 April 2015 at 01:57, Ted Yu yuzhih...@gmail.com wrote:

 Looks like MultiRowRangeFilter would serve your need.

 See HBASE-11144.

 HBase 1.1 would be released in May.

 You can also backport it to the HBase release you're using.

 On Sat, Apr 4, 2015 at 8:45 AM, Jeetendra Gangele 
 gangele...@gmail.com wrote:

 Here is my conf object passing first parameter of API.
 but here I want to pass multiple scan means i have 4 criteria for
 STRAT ROW and STOROW in same table.
 by using below code i can get result for one STARTROW and ENDROW.

 Configuration conf = DBConfiguration.getConf();

 // int scannerTimeout = (int) conf.getLong(
 //  HConstants.HBASE_REGIONSERVER_LEASE_PERIOD_KEY, -1);
 // System.out.println(lease timeout on server is+scannerTimeout);

 int scannerTimeout = (int) conf.getLong(
 hbase.client.scanner.timeout.period, -1);
 // conf.setLong(hbase.client.scanner.timeout.period, 6L);
 conf.set(TableInputFormat.INPUT_TABLE, TABLE_NAME);
 Scan scan = new Scan();
 scan.addFamily(FAMILY);
 FilterList filterList = new FilterList(Operator.MUST_PASS_ALL);
 filterList.addFilter(new KeyOnlyFilter());
  filterList.addFilter(new FirstKeyOnlyFilter());
 scan.setFilter(filterList);

 scan.setCacheBlocks(false);
 scan.setCaching(10);
  scan.setBatch(1000);
 scan.setSmall(false);
  conf.set(TableInputFormat.SCAN,
 DatabaseUtils.convertScanToString(scan));
 return conf;

 On 4 April 2015 at 20:54, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Hi All,

 Can we get the result of the multiple scan
 from JavaSparkContext.newAPIHadoopRDD from Hbase.

 This method first parameter take configuration object where I have
 added filter. but how Can I query multiple scan from same table calling
 this API only once?

 regards
 jeetendra

Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-05 Thread Jeetendra Gangele

Sure I will check.

On 6 April 2015 at 02:45, Ted Yu yuzhih...@gmail.com wrote:

 You do need to apply the patch since 0.96 doesn't have this feature.

 For JavaSparkContext.newAPIHadoopRDD, can you check region server metrics
 to see where the overhead might be (compared to creating scan and firing
 query using native client) ?

 Thanks

 On Sun, Apr 5, 2015 at 2:00 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Thats true I checked the MultiRowRangeFilter  and its serving my need.
 do I need to apply the patch? for this since I am using 0.96 hbase
 version.

 Also I have checked when I used JavaSparkContext.newAPIHadoopRDD its
 slow compare to creating scan and firing query, is there any reason?




 On 6 April 2015 at 01:57, Ted Yu yuzhih...@gmail.com wrote:

 Looks like MultiRowRangeFilter would serve your need.

 See HBASE-11144.

 HBase 1.1 would be released in May.

 You can also backport it to the HBase release you're using.

 On Sat, Apr 4, 2015 at 8:45 AM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Here is my conf object passing first parameter of API.
 but here I want to pass multiple scan means i have 4 criteria for STRAT
 ROW and STOROW in same table.
 by using below code i can get result for one STARTROW and ENDROW.

 Configuration conf = DBConfiguration.getConf();

 // int scannerTimeout = (int) conf.getLong(
 //  HConstants.HBASE_REGIONSERVER_LEASE_PERIOD_KEY, -1);
 // System.out.println(lease timeout on server is+scannerTimeout);

 int scannerTimeout = (int) conf.getLong(
 hbase.client.scanner.timeout.period, -1);
 // conf.setLong(hbase.client.scanner.timeout.period, 6L);
 conf.set(TableInputFormat.INPUT_TABLE, TABLE_NAME);
 Scan scan = new Scan();
 scan.addFamily(FAMILY);
 FilterList filterList = new FilterList(Operator.MUST_PASS_ALL);
 filterList.addFilter(new KeyOnlyFilter());
  filterList.addFilter(new FirstKeyOnlyFilter());
 scan.setFilter(filterList);

 scan.setCacheBlocks(false);
 scan.setCaching(10);
  scan.setBatch(1000);
 scan.setSmall(false);
  conf.set(TableInputFormat.SCAN,
 DatabaseUtils.convertScanToString(scan));
 return conf;

 On 4 April 2015 at 20:54, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Hi All,

 Can we get the result of the multiple scan
 from JavaSparkContext.newAPIHadoopRDD from Hbase.

 This method first parameter take configuration object where I have
 added filter. but how Can I query multiple scan from same table calling
 this API only once?

 regards
 jeetendra

Diff between foreach and foreachsync

2015-04-05 Thread Jeetendra Gangele

Hi
can somebody explain me what is the difference between foreach and
foreachsync over RDD action. which one will give good result maximum
throughput.
does foreach run in parallel way?

Re: conversion from java collection type to scala JavaRDDObject

2015-04-04 Thread Jeetendra Gangele

Hi I have tried with parallelize but i got the below exception

java.io.NotSerializableException: pacific.dr.VendorRecord

Here is my code

ListVendorRecord
vendorRecords=blockingKeys.getMatchingRecordsWithscan(matchKeysOutput);
JavaRDDVendorRecord lines = sc.parallelize(vendorRecords)


On 2 April 2015 at 21:11, Dean Wampler deanwamp...@gmail.com wrote:

 Use JavaSparkContext.parallelize.


 http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#parallelize(java.util.List)

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Thu, Apr 2, 2015 at 11:33 AM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Hi All
 Is there an way to make the JavaRDDObject from existing java collection
 type ListObject?
 I know this can be done using scala , but i am looking how to do this
 using java.


 Regards
 Jeetendra

Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-04 Thread Jeetendra Gangele

Here is my conf object passing first parameter of API.
but here I want to pass multiple scan means i have 4 criteria for STRAT ROW
and STOROW in same table.
by using below code i can get result for one STARTROW and ENDROW.

Configuration conf = DBConfiguration.getConf();

// int scannerTimeout = (int) conf.getLong(
//  HConstants.HBASE_REGIONSERVER_LEASE_PERIOD_KEY, -1);
// System.out.println(lease timeout on server is+scannerTimeout);

int scannerTimeout = (int) conf.getLong(
hbase.client.scanner.timeout.period, -1);
// conf.setLong(hbase.client.scanner.timeout.period, 6L);
conf.set(TableInputFormat.INPUT_TABLE, TABLE_NAME);
Scan scan = new Scan();
scan.addFamily(FAMILY);
FilterList filterList = new FilterList(Operator.MUST_PASS_ALL);
filterList.addFilter(new KeyOnlyFilter());
 filterList.addFilter(new FirstKeyOnlyFilter());
scan.setFilter(filterList);

scan.setCacheBlocks(false);
scan.setCaching(10);
 scan.setBatch(1000);
scan.setSmall(false);
 conf.set(TableInputFormat.SCAN, DatabaseUtils.convertScanToString(scan));
return conf;

On 4 April 2015 at 20:54, Jeetendra Gangele gangele...@gmail.com wrote:

 Hi All,

 Can we get the result of the multiple scan
 from JavaSparkContext.newAPIHadoopRDD from Hbase.

 This method first parameter take configuration object where I have added
 filter. but how Can I query multiple scan from same table calling this API
 only once?

 regards
 jeetendra

newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-04 Thread Jeetendra Gangele

Hi All,

Can we get the result of the multiple scan
from JavaSparkContext.newAPIHadoopRDD from Hbase.

This method first parameter take configuration object where I have added
filter. but how Can I query multiple scan from same table calling this API
only once?

regards
jeetendra

Regarding MLLIB sparse and dense matrix

2015-04-03 Thread Jeetendra Gangele

Hi All
I am building a logistic regression for matching the person data lets say
two person object is given with their attribute we need to find the score.
that means at side you have 10 millions records and other side we have 1
record , we need to tell which one match with highest score among 1 million.

I am strong the score of similarity algos in dense matrix and considering
this as features. will apply many similarity alogs on one attributes.

Should i use sparse or dense? what happen in dense when score is null or
when some of the attribute is missing?

is there any support for regularized logistic regression ?currently i am
using LogisticRegressionWithSGD.

Regards
jeetendra

1 2 >

1 - 100 of 102 matches

Mail list logo